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Abstract 

We wish to investigate compact representation of object programs, therefore we wish to 
measure entropy, the average information content of programs. This number tells how many bits, 
on the average, would be needed to represent a program in the best possible encoding. A 
collection of 114 Mesa programs, comprising approximately a million characters of source text, 
is analyzed. For analysis purposes, the programs are represented by trees, obtained by taking 
the parse trees from the compiler before the code generation pass and merging some of the 
symbol table information into them. 

A new definition is given for a Markov source where the concept of "previous" is defined 
in terms of the tree structure, and this definition is used to model the Mesa program source. 
The lowest entropy value for these Markov models is 1.7 bits per tree node, assuming 
dependencies of each node on its grandfather, father, and elder brother (order 3). These 
numbers compare with an approximate 10 bits per node required for a naive encoding, and an 
equivalent of 3.2 bits per node of code generated by the existing compiler. Motivated by sample 
set limitations for higher order models, we derive an entropy formula in which the order is non- 
uniform. 

The non-uniform entropy formulas are particularly suited to trees, where we can now 
speak of conditional probabilities in terms of patterns, or arbitrarily shaped contexts around a 
node. A method called pattern refinement is presented whereby patterns are "grown", i.e., the 
set of nodes matching an existing pattern is divided into those matching a larger pattern and 
those remaining. A proof is given that the process always leads to a lower estimate unless the 
old and new patterns induce exactly the same conditional probabilities. The result of applying 
this technique to the sample was an estimate of 1.6 bits per node. Further application would 
reduce this number even more. 

Analytic solutions for the error bounds in approximating the entropy of a Markov source 
are very difficult to obtain, so an experimental approach is used to gauge a confidence figure for 
the estimate. These calculations suggest that a more accurate estimate would be 1.8 bits per 
node, with a standard deviation of 13%. This corresponds to an entropy of .54 bits per character 
of source program. 

The methods of this thesis can be used both to define a bound for code compression and 
to evaluate existing object code. 
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1 . Introduction 

1.1. Measuring Program Entropy 

The designer of a system should have a good model of how it will be used. For 
example, much effort in early compilers went into optimizing program constructs that 
seldom, if ever, occurred in real programs. The modern trend is toward finding 
techniques that lessen the likelihood of such misdirected effort. A statistical study of 
the programs that are actually written in a language is a valuable tool for a compiler 
designer. Its worth can be manifested in several ways: in a more efficient compiler, in 
a faster program, or in a more compact object program representation. 

This thesis deals with estimating the entropy, or average information content of 
programs. Such a number will tell a compiler designer just how many bits of 
information an average program contains, i.e., how small the object program may be and 
still contain the full program specification. These numbers are usually given in terms of 
some easily computed dimension of the source programs, for example, length. Such 
knowledge of the typical source program can help the designer to design compact object 
programs, and can also provide a theoretical framework against which to measure 
progress. 

In recent years, there has been dramatic reduction in the cost of computer 
hardware, sparked by advances in integrated circuit technology. The current trend is 
away from large time-shared computers and toward small single user machines. One 
component that still is a major cost item in computers is memory. Those of us 
accustomed to computers having tens of millions of bits of main storage must learn to 
"think small." Since most interesting computer tasks require large amounts of memory, 
it is not uncommon to have a virtual memory, in which some of the information treated 
as main memory is actually residing on secondary storage, such as a disk. In these cases, 
the running time of programs is largely influenced by the size of the working set, that 
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portion of the virtual memory that is being actively referenced. Anything that reduces 
the size of programs can thereby also increase the speed. 

Working set considerations aside, we often are in a position to trade execution 
speed for program size, and vice versa. One longstanding technique used by 
programmers to reduce size is the interpretation of instructions specially designed for a 
given task. This is sometimes thought of as "table driven programming", where a single 
concise table entry can cause a specialized interpreter to execute a large number of 
instructions. One does, nevertheless, pay a price; running programs interpretively is 
typically slower than running programs written directly in the machine code. However, 
exactly what is meant by machine code is becoming less distinct. For some time, the 
"machine code" that programmers have used has actually been interpreted by programs 
written for a lower level instruction set, called microcode. The task of writing such 
microprograms has been done by a few experts and typically stored in read-only 
memory. In recent years, however, manufacturers have provided computers that store 
their microprograms in read-write memories, allowing a language designer to tailor the 
perceived machine language to match the constructs of the source language. Using the 
formalism of information theory, we shall see how the statistical properties of 
programming language usage can be exploited in the design of an interpretive code. 

1.2. Entropy of Programs 

Entropy is a quantity associated with a source, a process that emits a sequence of 
symbols (outputs) according to some fixed probability laws. Typically, the probabilities 
associated with a given source output are influenced by the values of previous outputs. 
When dealing with experimental data, we cannot determine the exact probability laws, 
but must be content to build a model of the source and estimate the entropy of the 
model. We will carefully construct our model so that the entropy of the model is an 
overestimate of the true entropy; hence, we know that our lower bound for encoding is a 
conservative one. 

There has been much activity in estimating the entropy of natural language. In an 
early paper on the subject [Shannon 51], several models of English language sources were 
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investigated. Some models had probability rules that depended on the value of the 
previous few characters or words; such models are called Markov sources. One can 
apply similar models to programming languages. With programming languages, however, 
we are dealing with a much more controlled grammar, so we need not consider the 
programs as mere strings of source text. We can exploit the structure of the statements 
in a straightforward manner. 

One way that programming languages differ from natural languages is that 
sentences have an underlying nested structure that can be thought of as a tree. Modern 
programming language constructs such as "IF. . .then. . .else" and "begin. . .end" 
allow statements much more deeply nested than even German natural language 
sentences. The author once wrote a program profile facility which produced a neatly 
indented listing of a program together with statement counts. One of the early users of 
the system uncovered a "bug" when one of his program statements that was nested over 
30 levels deep caused the placement procedure to "indent" the program completely off 
the right side of the page! Such programs are fairly uncommon, but they do help to 
point out the treelike nature of computer programs. 

In this thesis, we will be concerned with finding representations for programs that 
allow the tree structure to be exploited when estimating the average information content 
of an atomic program symbol. The representation chosen is a tree as produced by the 
parser and clarified by means of the program semantics. What is meant by the entropy 
of such a model? Standard formulas for the entropy of Markov sources require knowing 
the probabilities of all m-tuples of source symbols and conditional probabilities based 
upon all (/72-l)-tuples, numbers which are difficult to obtain for some m-tuples with any 
confidence due to the finite nature of our sample set. We will see an alternative 
formula in which the amount of history used in the calculation is context dependent 
rather than uniform. This formula has better error bound properties for estimating 
entropy from experimental data, and is readily generalized to an entropy formula for 
trees, allowing us to define the concept of a Markov source of trees and estimate its 
entropy with a reasonably sized empirical sample. 
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1.3. Survey of Related Work 

Empirical Study of Programs 

Historically, designers of compilers and languages have had little knowledge of how 
typical programmers were using a programming language. There has recently been a 
trend toward empirical study of programs. One of the better known works is by Knuth 
[KnuthTla], in which he reports on a collection of Fortran programs analyzed one 
summer at Stanford. His bibliography lists some of the earlier works in the field. 

The results of Knuth's paper are divided into static and dynamic statistics. In this 
thesis, we shall be primarily concerned with static statistics, although Chapter 6 contains 
a few words about the dynamic case. Knuth's static counts were made by a program that 
read source programs and maintained a count of occurrences of the various reserved 
words and operators of the language. The most striking conclusion that can be drawn 
from these numbers is that actual programs are far simpler than had been previously 
believed. Compiler writers had prided themselves in generating efficient code for 
complicated expressions, while in practice, expressions had an average length of only two 
operands. Knuth found that 68% of the assignments involved no operator at all. Figure 
1.1 shows the complexity of expressions in assignment statements where the operators + 
and - were given one point, * given five, and / given eight points: 

Complexity 1 2345 6789 

Number 56,751 14,645 1,124 106 267 2,436 1,988 562 2,359 552 
Percent 68.0 17.5 1.4 0.1 0.3 3.0 2.0 0.6 3.0 0.6 

Figure 1.1. Complexity of expressions in Knuth's Fortran sample. 

Further analyses showed that although Fortran allows arbitrarily complex control 
flow due to the presence of goto statements, all of the tested programs had reasonably 
simple flow graphs. Since the analysis programs were working with the surface strings 
of the language and not with parsed programs, they could not tell procedure calls from 
array accesses, so the interesting statistics about procedure parameters was not available. 
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Alexander and Wortman made similar studies of XPL programs [Alexander75]. 
Rather than write a routine to analyze source programs, they modified the compiler to 
count various statement and expression types. The XPL compiler is a BNF 
production-oriented one, so this task was easy. Like Knuth, they also analyzed the 
programs both statically and dynamically. The following enumeration from 
[Alexander75] shows statically obtained information they found useful for determining 
resource consumption: 

• distribution of constants by type and value; 

• distribution of variables by type, value, and point to declaration (lexical nesting); 

• distribution of statements by type; 

• complexity of expressions and use of operators in expressions; 

• distribution of machine instruction emitted by the compiler, also adjacent pairs 
and triples of instructions; 

• distribution of registers, operand addresses, and constants occurring in emitted 
instructions. 

Another interesting item from this paper is a table showing the distribution of 
numeric constants in their sample programs. Figure 1.2 shows an excerpt from their 
table, which also indicates the number of bits necessary to represent numbers in the 
given range by the standard number representation techniques. Their compiler treated 
unary minus as an operator rather than as a part of a number, so all constants in their 
sample were positive. 



Range 


Number 






Cumulative 


(logarithmic) 


of bits 


Number 


Percentage 


Percentage 


ZERO 


1 


7762 


15.6 


15.6 


[2**0,2**1) 


1 


8459 


17.0 


32.6 


[2**1,2**2) 


2 


3952 


7.9 


40.5 


[2**2,2**3) 


3 


2986 


6.0 


46.5 


[2**3,2**4) 


4 


4747 


9.5 


56.0 


[2**4,2**5) 


5 


4682 


9.4 


65.4 


[2**5,2**6) 


6 


5908 


11.9 


77.3 


[2**6,2**7) 


7 


4715 


9.5 


86.8 


[2**7,2**8) 


8 


4037 


8.1 


94.9, 


[2**8,2**9) 


9 


1372 


2.8 


97.7 



Figure L2. Distribution of numeric constants in Alexander's XPL sample. 
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Figure 1.2 provides the interesting result that more than half of the numeric 
constants of their sample could be represented in 4 bits. In the sample of programs 
analyzed below in Chapter 4, the results were even more dramatic: when one considers 
all the numeric constants either stated directly or resulting from "constant folding", the 
numbers 0, 1, 2, and 3 account for 52% of the total static usage. 

The author made an informal study of programs written in an early version of the 
Mesa language running on the PDP-10. This study was motivated by the trend toward 
personal computers that caused the effort on that language to be directed toward a small 
16 bit machine. In order to conserve space, the language was to be compiled into a 
compact code that was then to be interpreted. Several studies were made of existing 
Tenex-Mesa programs to give guidance to the designers of the interpreter. One of these 
studies, which was never formally written up, concerned the complexity of expressions 
and procedure calls. 

The Tenex-Mesa compiler parsed the programs into trees one statement at a time 
and then generated code from the tree in essentially a one pass style. Some of the 
operations normally associated with the code generation pass of classical compilers took 
place in the tree building process. For example, when building a tree for an iteration 
statement (for i. . .), the parser generated the standard trees of an assignment and an 
if-statement for the incrementation and testing of the control variable. It was easy to 
intercept the parse trees before the code generation and to "walk" through them, looking 
at various operators and their operands. 

Of particular interest were the assignment statements. Of the 4818 assignments in 
the Tenex-Mesa sample, 17% were of the form variable *- number, 10% were variable 
<r variable, and 18% were variable «- procedure call. Some 11% were increment or 
decrement statements, i.e., variable *- same variable ± number. Figures 1.3 and 1.4 show 
statistics taken on the arguments of procedure calls in the sample. 



args 





1 


2 


3 


4 


5 


>5 


count 


873 


2462 


1176 


387 


98 


34 


4 


per cent 


17 


49 


23 


8 


2 


1 


- 



Figure 1.3. Number of arguments in Tenex-Mesa procedures. 
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# of field array procedure 

args variable number selection access call other 



total 



1 


1601 


329 


224 


19 


153 


136 


2462 


2 


1464 


368 


187 


11 


100 


220 


2350 


3 


601 


319 


90 


7 


36 


108 


1161 


4 


187 


90 


61 


- 


5 


49 


392 


5 


62 


45 


27 


1 


7 


28 


170 


>5 


10 


3 


5 


- 


3 


3 


24 


total 


3925 


1154 


594 


38 


304 


544 


6559 


% 


60 


18 


9 


1 


5 


8 





Figure 1.4. Description of arguments to Tenex-Mesa procedures. 

Figures such as the above are helpful when deciding on conventions for subroutine 
linkage and immediate operands, but they need a firmer mathematical foundation on 
which the designer can base decisions about the relative merit of competing features for 
the interpreter. 



Probabilistic Grammars 

In an early paper on the entropy of languages [Grenander67], Grenander discussed 
the shortcomings of the simple Markov models of language generation such as those of 
the previously mentioned Shannon paper, in this case, the first order Markov model: 

... the probability that a certain word is generated given the string of words 
that precedes it, depends only upon the last word in the string. . . . Some defects 
of this model are obvious. Although it allows for stochastic dependence, it does 
so only in a very special way via interaction of neighboring words. It is easy to 
exhibit examples in which the dependence has longer span. . . . This defect of 
the model could be expressed by saying that the Markovian dependence is too 
linear; it attributes too much significance to the linear ordering of words making 
up the sentence. 

Grenander continues with a discussion of the exponentially growing number of 
probabilities needed for higher order Markov sources, and concludes that is impractical 
to get estimates of reasonable accuracy for source models of sufficiently high order to 
capture the span of dependencies in natural language text. He concludes: 
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. . . The prospect of such models therefore appears gloomy if we require both a 
high level of approximation and that we be able to estimate its structure from 
real data. What is wrong with this approach is clearly that we have generalized 
in a too mechanical manner. We must introduce the dependence in a way 
tailored to fit the phenomenon we study. 

He then continues by choosing to represent sentences by derivation trees, given a 
context free grammar, and calculate the entropy in terms of probabilities associated with 
the various productions used in the derivation of sentence. Rather than discuss his work 
in detail, we will consider two more recent papers in the field. 

In a recent paper [Soule74] on entropies of probabilistic grammars, Soule defines a 
grammar G as a triple {\^, Vj, R) consisting of a set of nonterminal symbols Vj,^, 
terminal symbols Vp and rules R. A rule is written 

p: s -^ a, 

where 0</?<l, s is in V^, and a is in (V^ U V^)*. Denote the set of rules that rewrite s 
by R^; for each s, we must have 



2p, = i. 

where p^ is the probability associated with rule r. 



Soule denotes the language generated by (?, with starting symbol s, by L(G, s). He 
chooses to define the entropy in terms of a derivation, a (possibly countable) sequence 
of rule-names that specify the generation of either a sentence in h{G, s), or (in bad 
cases) a countably long intermediate form. If more than one non-terminal is to be 
rewritten at some step in the derivation, we will choose to do so in a consistent order, 
such as always rewriting the leftmost non-terminal. The set of all derivations beginning 
with a rule in R^^ is denoted by Q(G, s). The probability of a derivation d, denoted 
P(d), is the product of the probabilities associated with all rules in the derivation. The 
function 

Gen^: fl((/, s) -* L{G, s) 

simply maps a derivation into the word that results when s is rewritten by the each of 
the rules that make up that derivation. 
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The probability P{w) of a word w in L((7, 5) is defined to be the sum of the 
probabilities of all derivations d such that Gen ^( J) = w. 

The derivational entropy of G is a vector H{G) of ^V components such that for each 
s in V|^. 

deQ{G, s) 
Of more interest than the entropy of derivations is the entropy of words. Soule defines 

the sentential entropy to be the following: 

//s(G, s) = 2 ^»*') ^08 ^»^)- 
>v€L(G, 5) 

He then proves a theorem that H^(G, s) < H(G, s), with equality if and only if L((7, s) 
is unambiguous. Soule gives several results relating the mean word length and sentential 
entropy to various information theoretical concepts such as information rate, and 
channel capacity. 

Thompson and Booth [ThompsonVla] had previously considered various schemes 
for encoding the languages generated by probabilistic grammars. For a probabilistic 
language L, they defined a code to be another probabilistic language C, with terminal 
alphabet typically {0, 1}, such that there is a mapping of L onto C. In general, one code 
word in C is the empty code word e, where w€L is mapped onto e if P(w) = 1 or 0, i.e., 
if w is impossible or certain. If L^CL is defined as 

Lg = {w€L| w maps onto e}, 

then the mapping of L-L^ onto C-e is a one-to-one mapping. 

Thompson and Booth investigated four classes of coding automata that use the 
probability information to define the mapping from L onto C in such a way as to be 
optimal under various constraints. They are character encoding, word encoding, 
grammar encoding, and parse encoding. The character encoding is the standard 
Huffman encoding [Huffman52] obtained by calculating the probabilities of the various 
characters. The word encoding requires a finite language, but accommodates general 
ones by approximating the language by a finite one and including an "escape code". The 
grammar encoding uses the previous k characters of a sentence to determine the 
encoding of the next character. They show that if the language has a property called 



Chapter 1: Introduction 



LL(A:), then the language C is a context-free probabilistic language, and, since there are 
means of determining the average word length for such languages, the codes can be 
compared. The parse encoding is similar to the entropy model of Soule; one encodes the 
sequence of productions from one possible derivation of the sentence to be encoded. 

The grammar based models of language encoding are similar to the Markov model 
we shall be concerned with in later chapters of this thesis, restricted to uniform first 
order approximations in which only the father of a node in the parse tree affects the 
conditional probabilities of this node. Furthermore, the trees we shall deal with contain 
considerable semantic information. 

Statistical Models 

One of the early mentions of program entropy was in a set of unpublished notes 
written by McKeeman and Horning sometime around 1968-69 [McKeeman68]. Their 
principal concern was determining the appropriateness of a particular machine 
organization for a given programming language by seeing which machine yielded the 
smallest program. They also had some thoughts on theoretical bounds of encoding using 
an entropy-like formula. They considered how to deal with the problem of variable 
names: 

. . . We clearly do not wish to distinguish, in machine code, programs that differ 
only by a systematic substitution of identifiers. We propose that we first 
tabulate all identifiers used in a program and then systematically replace them by 
a standard set (say XI, X2, . . ., XlOO, etc.). We can reduce the set even more by 
taking into account block structure and the permissibility of duplicate use of 
names, or may have to do something more complex if the form of the identifier 
carries some semantic information (for instance the [Fortran] IJKLMN type 
convention). The essential point, reducing the variability of programs by using a 
consistent naming convention, is straightforward for any given conventional 
programming language. 

Their means for estimating the entropy of a given program was essentially that 
used by Soule. Their methods did not take into account any influence of the 
probabilities of one production caused by the previous ones, although they pointed out 
that the higher order dependencies should be considered when defining entropy. At the 
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time, they were only considering conventional machine architectures where such 
conditional probabilitites could not be exploited. 

In his thesis [Hehner74], Hehner applied the framework of information theory to 
the design of computer hardware, both instruction encoding and data encoding. In a 
more recent paper [Hehner77], he was concerned primarily with the encoding of 
computer instructions. He found that for his sample, as much as 75% of the space taken 
by contemporary machine-language representations could be saved. One question that 
he addressed is the reliability implications of removing redundancy from object 
programs: 

One may object to the goal of minimizing redundancy on the grounds that it is 
needed for reliability. Some forms of redundancy allow the detection of some 
errors. ... If the error results in a legal instruction, it may be detected indirectly 
but escape identification, or it may escape detection. . . . The use of accidental 
redundancy in . machine-language instructions for error detection is at best a 
haphazard approach, and at worst a poor excuse for badly-designed codes. The 
purpose of machine-language is to specify a sequence of actions as succinctly as 
possible. Error detection ability is important enough to deserve its own separate 
mechanisms, specially designed for that purpose, such as parity bits or tag bits. 

Hehner also discussed the entropy of programs; he represented the program as a 
sequence of tokens in the conventional way (rather than as tree structure), and he 
determined an entropy estimate in terms of bits per token. If there is a large sample of 
programs, one can hope that the frequencies of the various tokens are representative of 
programs in general. This is also probably true for pairs of tokens, and, with decreasing 
confidence, for higher order /n-tuples. He defined the entropy in terms of conditional 
probabilities much as we shall do in Chapter 3, but did not provide any numerical 
estimates from his sample for the entropy of the tokens. 

Hehner then applied the same methodology to the object programs generated by the 
XPL compiler. The stream of instructions can be thought to have conditional 
probabilities based on previous instructions just as source tokens do. Since his 
measurements were all static, it is necessary to establish a known context after every 
label. One very interesting portion of his paper is an analysis of a process he calls 
iterative pairing wherein commonly occurring adjacent pairs of instructions are replaced 
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by a single one. He gave criteria for deciding how much the information content of the 
program will decrease upon making the replacement (it may actually be increased by the 
replacement). He then discussed the encoding of instructions using their conditional 
probabilities based on m previous instructions. Figure 1.5 shows the numerical results 
from encoding his sample by various schemes. The "minimum redundancy" figure 
comes from applying the Huffman coding procedure to the "360-like" instructions. 

bits per percent percent 
encoding operation of (b) of (a) 



(a) 


like IBM 360 


8 






(b) 


"minimum redundancy" 


3.6 




45.3 


(c) 


iterative pairing 


1.8* 


49.9 


22.6 


(d) 


conditional coding 










1 preceding 


2.1 


57.2 


25.9 




2 preceding 


1.7 


47.0 


21.3 




3 preceding 


1.6 


43.8 


19.8 



* bits per original operation, 4.85 bits per compound operation. 

Figure 1.5. Code compression in Hehner's XPL sample. 

Finally, Hehner showed that in the limit, conditional encoding of instructions is 
always better than grouping together common operations for code compactness. The 
principal differences between the work of Hehner and the author's work are in the 
source of tokens to be encoded, object instructions vs. parse tokens, and in the structure 
imposed upon them, linear vs. tree structure. 

Program Oriented Encodings 

The use of an interpreter of specialized codes has long been a practice by software 
designers to reduce the size of programs. Recently, there have been some papers on 
producing computers with architectures well suited to a particular language. Hehner 
discusses a number of the earlier works in his thesis [Hehner74], one of the better 
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known examples being the Burroughs B5000 series of computers. Now that 
microprogramming has made computer architecture such a malleable concept, we can 
expect more work in the field. 

Deutsch [Deutsch73] describes an architecture for a machine to execute LiSP 
programs, called MicroLlSP. He observes the fact that a typical Lisp function references 
rather few functions and variables, and hence can be encoded by instructions with short 
address fields that refer to an external table, called the "local name table". In addition, 
there are a small number of commonly referenced functions stored in a "global name 
table". These techniques, plus others described in the paper allow object programs of 
one-third to one-fourth the size of the same programs compiled for the PDP-10 by 
more conventional means. 

Deutsch also presents an argument for program oriented architectures concerning 
the ease of writing debugging aids: 

MicroLlSP has been presented as a machine language, but slight additions would 
permit unambiguous decompilation into the original S-expression for editing. 
This approach is only feasible in general when the machine language closely 
resembles the source code: compilers for conventional machines must rearrange 
and suppress the original program structure extensively to achieve efficient 
execution. Interpretive systems, of course, generally do reconstruct the source 
text from an intermediate representation, often using their knowledge of the 
program structure to advantage (e.g. indenting to indicate depth of logical 
nesting). 

Wade and Stigall [Wade75] analyze the potential cost savings of several encoding 
features. They first analyze the grouping of common sequences of instructions into new 
instructions. This is essentially the "iterative pairing" process of Hehner, and the 
criteria for improvement are essentially equivalent. They also analyze the situation 
where the instructions can be broken into several classes, with the interpreter having 
"modes". They give formulas for the code length improvement as a function of the 
average number of instructions executed before a "mode switch" instruction is needed. 
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Foster and Gonter [FosterTl] describe a technique for conditionally encoding 
computer instructions. They observe the common sequences of instructions in object 
programs and make the following observations: 

...Given that we have just executed instruction x, it is not the case that all 
possible instructions are equally likely to follow x. For example, "load 
accumulator" is very rarely (in most codes) followed by "enter accumulator." . . . 
Suppose instead of complete generality we decide that in the normal course of 
events we will allow A'' different instructions to follow a given instruction. One 
of these must provide an escape mechanism to allow for the unusual program. If 
A^ is much less than the total repertoire of the machine, we can achieve 
considerable compression of the op-code field. 

To try out their ideas, they took a collection of programs written in assembly 
language for the CDC-3600. Figure 1.6 shows the percentage of op-code transitions 
captured by the mechanism as a function of the size of the conditional op code field. 
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Figure 1.6. Conditional code efficiency from Foster and Gonter study. 

The total number of possible instructions is 142. Clearly conditional encoding of 
the instructions would reduce the total space required. They discuss the difficulties of 
programming such a machine in assembly language, primarily in debugging. The 
modern trend is away from assembly language programming, so this is less of a problem; 
a debugger that operates on a source language level can be made smart enough to 
translate the conditional codes into something understandable. 

Of course, computer programs do not consist entirely of instructions, they also 
contain data. For some languages, it is possible to effectively encode these data by 
techniques exploiting their statistical properties. For the language Lisp, the line 
separating instructions and data is a fuzzy one; when developing programs, one tends to 
run them interpretively. Clark and Green [Clark??] describe an empirical study of LiSP 
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list structures. A computation that they perform that is most related to the content of 
this thesis is the entropy of car and cdr pointers (relativized to the address of the cell 
containing the pointer). The common LiSP systems for the PDP-10 use 18 bits for each 
of the pointers. Clark and Green found considerably fewer bits of information in the 
pointers of the programs that they studied. 

The statistics of that paper are all static, but in his thesis [Clark76], Clark also 
obtained dynamic statistics on distribution of values for car, cdr, and other language 
features. In addition, he gave algorithms for linearizing lists, i.e., rearranging them so 
that the cdr (car) points to the very next cell in memory whenever possible. Two thirds 
of the car's and one fourth of the cdr's do not point to lists at all, in which cases the 
algorithm does not apply. Nevertheless, one would predict that the entropy estimate of 
the affected pointer would go down after such an operation. In fact, the entropy 
estimate of both pointers usually goes down; Figure 1.7 shows the entropy of car and cdr 
in the original data, and after either car or cdr direction linearization. They chose to 
give separate values for each of five large sample programs. 

original car direction cdr direction 

sample data linearization linearization 

number car cdr sum car cdr sum car cdr sum 

7.90 2.37 10.27 9.15 1.13 10.27 

8.63 3.04 11.66 9.88 1.98 11.86 

8.35 2.38 10.73 9.56 1.28 10.84 

4.96 2.39 7.35 6.13 1.17 7.30 

8.16 2.07 10.23 9.19 .99 10.17 

Figure 1.7. Pointer entropy from Clark and Green study. 

The reason that the entropy estimate for cdr dropped more dramatically upon 
linearizations is that the cdr pointed to another list a much higher percentage of the 
time than did the car. Also, the car direction linearization tended to put the cdr cells 
nearby as well, so either form of linearization produced about the same numbers. Clark 
and Green point out that the actual entropy of the pointers should be dependent only on 
the semantics of the language and not on the particular representation chosen. Thus the 
smaller entropy numbers obtained by linearizing the lists more accurately reflects the 
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entropy of the pointers. 

1.4. Summary of Thesis Results 

In this thesis, a large sample of programs, amounting to approximately a million 
characters of source code, is analyzed, using new techniques for analysis. In the first 
place, the representation chosen for study is a tree, for reasons described in Chapter 2; 
an entropy number is calculated in terms of bits per node, where the nodes are those 
making up the trees. Since there is a reasonable correspondence between input tokens 
and tree nodes, these numbers could also be stated in terms of more familiar dimensions 
of programs. By making several extensions and generalizations to the entropy formulas 
for sources of strings, the entropy of trees is defined in terms of the probability 
distribution of nodenames occurring at any given point in the tree. In most cases, these 
probabilities depend upon the values of various neighbors of the given node. 

Several mathematical models are presented to capture the dependencies of nodes 
upon their neighbors. The first class consists of the uniform Markov models, in which 
there is a fixed collection of neighbors determining the conditional probabilities of a 
node. Figure 4.2 contains detailed results of the entropy estimates using these models; 
Figure 1.8 is a summary of the entropy values. 
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3.16 


grandfather, father 
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grandfather, father, elder brother 1 . 66 



Figure 1.8. Summary of uniform Markov entropy estimates. 

The finite nature of the sample set causes problems when using these uniform 
formulas. For a model using m neighbors, the formulas require probabilities for all 
(/w+l)-tuples, and conditional probabilities based on all w-tuples. For those m-tuples 
with only a few occurrences in the sample, there is a high likelihood of error in the 
probability estimate. 
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Dissatisfaction with the uniform models motivated the derivation of a new 
non-uniform entropy formula wherein nodes are conditioned on various collections of 
neighbors depending upon the context. The collection of neighbors that influence the 
probability distribution of a place in the tree is called a pattern. An initial set of 
patterns is generated by a modification of the uniform Markov entropy models. A 
procedure for obtaining a larger set of patterns, called pattern refinement, is given, with 
a proof that the entropy of the enlarged model is a better estimate of the true entropy, 
except in a specialized case, where it is not worse. Figure 1.9 shows the entropy estimate 
both before and after enlarging the pattern set. One cannot compare the size of the 
patterns exactly with the number of conditioning nodes in the uniform models since the 
patterns specify not only who the father is, but which son position a particular node 
occupies. Nevertheless, the number of nodenames specified in the pattern is given in the 
figure. The last two columns relate the estimate to more familiar quantities. 



number of patterns Source Equivalent 

nodenames entropy bits per bits per 

specified: 12 3 estimate token character 

initial set 154 2.1 3.1 .63 

final set 152 41 75 1.6 2.4 .48 



Figure 1.9. Non-uniform Markov entropy estimates. 



The reduction in the estimate is not as dramatic as that of the uniform Markov 
estimates, partly because the initial set is already well chosen; furthermore, a 
conservative approach was taken of only increasing the order in those cases where there 
was a reasonable sample for estimating the probabilities. 

Analytic solutions for the error bounds are very difficult to derive when dealing 
with Markov processes. In order to obtain some confidence in the probabilities that 
determine the entropy of the model, the sample was therefore divided into several pieces, 
with a portion of the sample, called a training sample, used to postulate an encoding for 
the trees. The remaining test sample can be encoded according to these codes and an 
average code length computed. If the statistics of the sample are uniform over the 
pieces, this average length will be close to the entropy. By applying this process 
repeatedly with each piece having a turn as the test sample, we obtain a more likely 
estimate of the entropy for the sample to be 1.8 ±13%. 
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1.5. Guide to the Reader 

Chapter 2 discusses ways of representing a program as a tree structure. A sample 
program is given which is used throughout the thesis whenever examples of 
representation or encoding are needed. Details of the representation ultimately chosen 
are presented, together with diagrams of trees drawn from the representation of the 
sample program. 

Chapter 3 contains most of the formal mathematics of the thesis. It contains 
definitions of all the information theory terms used, and a discussion of trees as outputs 
from a Markov source. The non-uniform formula for entropy of a Markov source is 
derived, and a generalization of the formula is made for trees, taking the structure of the 
trees into account when defining the concept of "previous". Finally, there is a proof 
that the pattern refinement procedure, described further in Chapter 4, always yields an 
improvement in the estimate (although increased experimental error may outweigh the 
improvement). 

Chapter 4 contains the experimental results. The sample set, comprising 
approximately a million characters of source programs, is described. Results are given 
for the entropy of Markov source models of various orders, using the uniform formula 
for entropy. The patterns used in the more general entropy formula are described, and 
sample results are shown from the entropy estimate determined from the original set of 
patterns. The new methodology called pattern refinement, which obtains a lower 
estimate by enlarging the set of patterns, is discussed, together with sample output from 
the process applied to the sample set of programs. Finally, it is shown how the tools 
developed for the pattern refinement procedures can be used to evaluate the merit of 
various program transformations interactively. 

Chapter 5 discusses the potential error in the entropy estimate. An experimental 
test of sample variability is described, together with results of applying the test to actual 
data. Another formula for the entropy estimate from a finite sample is presented which 
is faster and easier to compute than those of Chapter 3. 
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Chapter 6 discusses application for the methodology of the thesis and possible 
extensions and suggestions for further research. 

Appendix A contains an alphabetical listing of the nodenames occurring in the 
program tree representations defined in Chapter 2. 

Appendix B contains a few comments about how some of the non-obvious 
algorithms of the thesis were implemented, together with a few general pointers for 
anyone who wishes to apply similar analyses to other programming languages. It also 
contains a description of the algorithm used to format the tree diagrams that appear in 
various figures of the thesis. 

Appendix C contains detailed statistics for both the initial pattern set and the final 
pattern set. They are shown sorted by pattern number, by entropy contribution, and by 
pattern description (alphabetically). These statistics reveal a great deal about the nature 
of "typical" programs, since they show the conditional probabilities of a wide variety of 
language constructs. 
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2- Deciding on a Data Representation 

The programs analyzed in this thesis were written in Mesa [Geschke77], a 
language under development at Xerox Palo Alto Research Center, being implemented on 
a 16 bit minicomputer. For the purposes of this thesis, one can consider Mesa to be a 
dialect of Pascal [Wirth71]. 

Our goal is to estimate a bound for the space required to hold an object program, 
which we will do by studying a collection of existing programs. First, we must decide 
what representation of the programs should be adopted for study. Any representation 
that retains enough information to generate the object program is a legitimate candidate, 
but some representations allow a lower estimate, or make more intuitive the 
interpretation of partial results. Therefore, let us consider several possibilities. 

2.1. Representations not Employed 

The work of Thompson and Booth involving probabilistic languages used the 
grammar of the language to draw conclusion about encoding efficiency and hence 
entropy. This approach has some attractiveness because of the large body of theory 
involving grammars. However, the existing grammar used by the parser in the existing 
compiler has many nonterminal symbols that are present only to allow its specific 
parsing m.ethod to work, and do not really bear inform.ation. This gramm.ar is not the 
proper one for incorporation into such a theoretical framework. Moreover, there are 
program dependencies which are semantically based, which would not be considered if 
we deal only with syntax. 

Hehner used a related representation by investigating the sequence of tokens that 
make up a program. This is an improvement over surface strings, but does not capture 
nested dependencies where there is no adjacency of source text. He also studied the 
program object code, a representation in which too much encoding has already taken 
place to judge the true entropy of the language. 
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2.2. Parsed Structure as Representation 

In any language such as Mesa (or Algol) that allows nested statements, a program 
can be viewed naturally as a tree. Such a representation allows the control structure and 
inter-statement dependencies to be easily manipulated by analysis programs. A tree 
representation has the advantage of being less tied to the surface structure of the 
particular programming language than most other choices. Parsed ALGOL programs and 
Lisp programs look quite similar. 

Fortunately, the Mesa compiler is a ready source of trees for programs. The 
compiler is written in the classical manner, with multiple passes operating on a 
program. It represents the intermediate stages as a tree. The initial tree is constructed 
by an Lalr parser, and succeeding passes modify the tree in light of increasing 
knowledge, or extract information into symbol tables, literal tables, etc. The final passes 
generate code from the entire collection of information. In the discussion below, 
references to "the compiler" are to this multipass system. Other compilers would not 
necessarily perform the same actions at the various passes. 

Examples of language and encoding features discussed in the remainder of this 
thesis will be drawn from the following procedure, paraphrased from a stream 
input-output program. It is not really necessary to understand the operation of this 
program, we shall be concerned only with superficial aspects of its syntax. 

WriteString: PROCEDURE [s; STRING] = 
BEGIN 

C: CHARACTER; 
i: INTEGER; 
FOR i IN [0. .S. LENGTH) DO 

C <- s[i]; 

WriteCharacter[c]; 

ENDLOOP; 
IF S. LENGTH ^ 

THEN StartofLine* <- (c = CR); 
END; 

Figure 2.1. A sample Mesa program. 
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To aid readers in understanding the syntax, a few explanatory comments are in 
order. 

• id : type is a declaration of id. The type of WriteString is a procedure that 
takes one parameter of type string. The "=" in the declaration equates 
WriteString to the following procedure body. 

• id . id' is a field selection operation. If id is a POINTER, the expression refers to 
the id' field of the RECORD to which id points. If id is of type RECORD, the 
expression refers to the id' field of id. The predefined type string is a pointer 
to a record containing a length field and the actual string text. The construct 
s. LENGTH is the length of the string s. 

• \_exp..exp') is a half-open interval from exp up to but not including exp'. The 
FOR statement is executed for i taking on each value in the range. 

• Procedure arguments, like array indices and character selections, are enclosed in 
square brackets []. 

WriteString is a simple procedure that writes a string by writing each character 
in turn. It finally checks to see whether the last character written is a carriage return, 
and if so, sets a boolean flag saying that the output device is currently at the beginning 
of a new line. Figure 2.2 shows the statement list of the above procedure body as parsed 
by the first pass of the compiler. 

list 



/ 

dostmt 

\ 


\ 
if stmt 

1 


1 \ 
upthru <empty> 


1 1 \ / 
list <empty> <empty> relN 

i 1 


1 \ 
assign <erapty> 


/ \ / 
i intCO assign 
_1 _L 


\ / \ 
apply dot : 
1 1 


/ \ 
StartofLine relE 
1 


/ \ / \ 
dot c apply 

/ \ / \ 
s length s i 


/ \ / \ 
WriteCharacter c s length 


/ \ 
c CR 



Figure 2.2. Portion of a parse tree generated by Pass 1 of the Compiler. 
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In the remainder of the thesis, nodenames taken from trees will be underlined. For 
those readers not familiar with parse trees, Figure 2.2 may be somewhat difficult to read 
at first, but the meaning is actually quite straightforward. The procedure body is a list 
of statements, in this case a dostmt and an if stmt . The subtree rooted by dostmt is 
generated for the source construct "FOR i . . . ". Notice that some of its sons are empty, 
denoted "< empty >". This is because the language allows several forms of iteration 
statement, with, for example, while clauses at the beginning or end, or with nonstandard 
exits. The parser generates the same general tree for all such statements, merely leaving 
the subtree "empty" for those features not present in the given source statement The 
ifstmt tree is easier to understand. It has three sons, the test, the "then" part and the 
"else" part In the example program, there is no "ELSE" part. Appendix A gives a 
complete list of nodenames, together with number of sons and the source constructs 
which give rise to the node. 

Such a tree representation of the program clearly does not discard any important 
information. In fact, a quite successful program has been written to produce neatly 
indented listings that takes these trees and unparses them, using the program structure to 
control line breaks, etc. For analysis, however, it is best to wait until later passes of the 
compiler have had a chance to embellish the tree. 

Knuth's study of Fortran programs used the surface strings as a representation 
[Knuth71]. We could produce similar results from Pass 1 trees, and would, in fact, have 
an easier task than he did, since statement typing and decomposition has already been 
done by the parser. The study of Tenex-Mesa programs described in Chapter 1 was 
done using Pass 1 trees. 

Experience with Pass 1 trees suggests several problems. Some syntactically similar 
program constructs must be differentiated by means of the semantics. Thus the string 
character selection and procedure call in the dostmt of Figure 2.2 are both parsed as 
apply by Pass 1. Later passes will convert the subtrees to seqindex and call , 
respectively. 

The names of variables allow us to tell very little about variable usage. In a 
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strongly typed language, two identifiers of the same name appearing in the same 
expression might well have entirely different semantics. The earliest time to obtain a 
tree in which variable names have been disambiguated is at the end of Pass 3, when each 
name is replaced by an index into the compiler's symbol table. This seems like a good 
source of trees for analysis. Indeed, the initial programs developed for this thesis 
manipulated Pass 3 trees. 

Recall that we wish to encode programs using a minimum of space. As we shall see 
in Chapter 3, increasing redundancy can help to make more compact encodings. 
Programmers know that each program or procedure contains a few variables that are 
used frequently and some others that are used only a few times. Since the actual names 
of the variables do not affect the execution of the program, we can increase redundancy 
for analysis by renaming the variables of each procedure to be the same, ordering by 
frequency of reference. We are concerned with minimizing object code size, so static 
frequency is a sufficient measure. 

One cannot tell the type of a variable from its name. This information is available 
elsewhere in the parse tree in the declarations, but such remote references are not very 
convenient for analysis. Similarly, some "variable" names actually refer to named 
constants, such as CR in the example, which is declared earlier to be equal to the ASCII 
code for carriage return. For analysis, we wish to replace such references with their 
values. 

2-\ Tree Rpnrp«:pntjitiftn EmDloved In Analvsls 

It is not until Pass 4 of the Mesa compiler that named constants are replaced by 
their value. As an added bonus, "constant folding" has taken place, i.e., arithmetic on 
constants has been performed at compile time. However, by this time, the declarations 
have been completely processed and removed from the trees. This poses a problem: if 
we are to benefit from the processing done by Pass 4, we need to encode some 
information from the symbol table. Otherwise, we would be lacking information about 
lengths of variables, etc. 
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Putting the Variable Type Information Back into the Trees 

The symbol table entries for variables contain all the relevant information about 
type. One way to retain the information of the program would be to encode the parse 
tree and symbol table separately. This would be similar to the approach used with Pass 
3 trees, since the declaration trees were easily separable. This would retain a 
shortcoming that was observed in the statistics obtained from the Pass 3 trees: we know 
what operations are executed on variables, but we don't know the types of variables 
involved. That is an oversimplification; the type information is in the tree, but is not 
located conveniently to the actual variable occurrences. We solve this problem by 
placing information from the symbol table back into the tree. Each instance of a 
variable in a tree is replaced by the collection of information necessary to generate a 
proper object program from the tree. This puts more information into the tree than is 
absolutely necessary, but it simplifies investigations of operations needed to execute 
actual programs. 

Variables can be divided into four classes, depending upon where they are 
declared. For reasons related to the runtime structure, we will call the class name a 
frame. 

1- global "" a variable declared outside of procedure bodies. 

2. local — a variable local to the given procedure. 

3. field -- a named field of a record. 

4. entry — a procedure entry point. 

In our discussion above, we saw that renaming the variables within each procedure 
increases redundancy. It is sufficient to assign sequence numbers to the variables, 
ordered by frequency of use. Fortunately, the compiler does just that in assigning 
addresses, so that we may simply use the address of each variable within its procedure 
frame as its sequence number. 

Some variables, primarily field variables, begin at arbitrary places within a word. 
Thus, a variable must have a bit offset. 
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Not all variables occupy a single full word of memory. For example, the type 
boolean is defined as a single bit. A length is therefore needed for a variable. The 
compiler unpacks non- field variables into whole words, but for analysis purposes, the 
length used is the minimum number of bits necessary to hold a value of the given type. 
In a language supporting floating point operations, additional information would have to 
be included to distinguish what operations were used with the variable. 

To summarize, each instance of a variable in the Pass 4 parse tree can be replaced 
by a tree with root var and four sons, corresponding to frame, address, bit offset, and 
length. For example, the variable c, an 8-bit character variable in the local frame, was 
assigned by the compiler to word 2 in the frame. References to c are replaced in the tree 
by 

var 



/ I I \ 
local 2 8. 

This makes the trees rather cumbersome to read, but it puts the pertinent variable 
information in a form easily manipulated by tree matching procedures. 

Marking Literal Constants in the Trees 

The nodes that provide type information for variables introduce a great quantity of 
numbers into the trees. We are interested in knowing what numbers are used by 
programmers as literal constants; therefore, two nonterminals were invented in the 
analysis programs for specifying numerical and string constants. A.11 numerical 
constants are placed below num nodes, and all string constants are placed below str 
nodes. This mechanism provides a positive test (under num ), instead of just a negative 
test (not under var ) for determining the presence of a literal. It also helps to avoid 
fragmentation of common patterns, as we will see in our later discussion of patterns. 
Assuming that / is allocated the first location in the frame, the statement i *- 1 would 
compile to the tree 
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local 16 1. 

Dealing with Arbitrarily Long Lists 

Several schemes for dealing with the variable length 1 ist nodes were considered. 
Although the programs studied are written for a 16 bit computer, the analysis routines 
are based on a compiler that runs on a 36 bit machine. In this compiler, each 
non-terminal tree node has a field in the node header that gives the son count. This is 
redundant in all cases except 11st since all other nodes have a degree that is known a 
priori. Hence, only 1 ist nodes need an explicit means for determining son count. In 
the Pass 3 trees, the convention was an invented node 1 istterm as the final son of each 
1 ist node. This is equivalent to parentheses with 1 ist as the opening bracket and 
1 istterm as the closing one. One of the problems with that scheme was the difficulty 
of relating the length of lists to other context information. Several other, forgettable 
schemes were considered when the decision was made to use Pass 4 trees, but the one 
chosen allows the lengths to be easily related to context in the same manner as are other 
tree nodes: the first son of 1 ist is a count of the remaining sons. In the experimental 
data, 85% of the 1 ist nodes have fewer than 5 sons, so considerably fewer than 16 bits 
are required on the average for the son count. Results from the entropy studies were 
used by the compiler designers to decide on an encoding of 1 ist in versions of the 
compiler that run on the 16 bit machine. 

Example of a Transformed Tree 

Figure 2.3 is the transformed Pass 4 tree of the dostmt of Figure 2.2. Note that 
the compiler has used the semantics of s and WriteCharacter to convert the apply 
nodes of Figure 2.2 into the proper seqindex and call nodes. 
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do stmt 



/ I i 1 V 

upthru <empty> list <empty> <empty> 



/ \ / j \ 

var IntCO 2 assign call 

j . I I I . 

local 16 nuro dot var seqindex var var <empty> 

I / ' — "\ /~\t\ r — ' — \ /— Vr\ /~Vr\ 

var var local 2 8 var var global 9 16 local 2 8 

/ — hrv /"~4t\ / — h~\ / — hr\ 

local 1 16 field 16 local 1 16 local 16 

Figure 2.3. Portion of a converted Pass 4 tree. 

In these trees, there is enough information to generate the same object code as the 
compiler does from the original program. One could not, however, recover the exact 
surface strings of the original, since the variable names are not in the tree. Furthermore, 
the user may define types that are used for compile time type consistency checking, but 
are the same at machine level. The information is moved into the trees after the 
compiler has done such checking. 

2.4. Summary of Program Representation 

The empirical study of programs described in the remainder of this thesis 
represents a program by the parse tree obtained from the compiler after Pass 4, during 
which declarations are removed and expressions involving only constant values are 
replaced with their computed value. The analysis routines further replace variables by 
subtrees giving their address and length, and mark literals by inserting nonterminal 
nodes num and str above them. Variable length list nodes are encoded by placing in 
their first son position a count of the remaining sons. 



28 



3. Finding an Entropy Model 

3.1. Some Definitions from Information Theory 

In order to discuss the methodology used for program analysis, it is necessary to 
present a few definitions and results from information theory. Let us consider a few 
definitions [Abramson63]. 

Definition: Let E be some event which occurs with probability P{E). If we are 
told that the event E has occurred, then we say we have received 

HE) = log -^ = -log KE) 
P{E) 

units of information. 

The unit of measure depends upon the base of the logarithm. Typically, log is 
used and the unit of measure is the bit. 

Definition: Let S - {5^, ^2, ... , s\ be a fixed finite source alphabet. A 
discrete information source is a process that emits a sequence of source 
symbols according to some fixed probability law. 

The simplest type of source in one in which successive symbols are statistically 
independent. Such a source is called a zero-memory source and is characterized by the 
alphabet S, and the probabilities 

F(5i). P{s^\ .... P{s^) 
with which the symbols occur. We will often refer to a source by the name of its source 
alphabet when this can be done without danger of confusion. 

Definition: Let S be a zero-memory source with a given probability distribution 
of the symbols. The entropy of S, denoted H(S), is defined to be the average 
amount of information per source symbol: 
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Entropy can be thought of as a measure of uncertainty. If a source usually emits 
the same symbol, there is little uncertainty, and also by the above definitions, the source 
has a low entropy. While the definition in terms of logarithms may seem somewhat 
arbitrary, it can be shown (see [Ash65] Section 1.2) that any uncertainty measure 
satisfying a particular set of reasonable axioms is equal to H within a constant 
multiplicative factor. We will later consider the definition of entropy for discrete 
information sources with more sophisticated probability laws. 



messaged 



noisy 
channe 



source 
encoder 



channel 
encoder 



channel 
decoder 



source 
decoder 



-^ message' 



noiseless channel 
Figure 3.1. A model of information transmission. 

The principle application of source information and entropy is to the problem of 
source encoding. We can envision a process where source messages are converted into 
some sort of code, transmitted across a channel, and decoded on the other side. There 
are generally two dissimilar kinds of encoding taking place: source encoding and channel 
encoding. In the former, the redundancy of the source output is being exploited to allow 
shorter average code lengths. In the latter, redundancy is often added so that errors 
induced by noise on the channel may be detected or corrected. As a simple example, 
parity is added to teletype characters. Often the channel encoding and decoding is 
"factored out" to yield an abstraction called a noiseless channel. Figure 3.1 is a block 
diagram of the transmission process. The criterion for source encoders and decoders is 
that no relevant information is lost between message and message'. Our goal is to 
investigate encoders for programming languages that minimize the average object code 
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length. Therefore, the relevant information is that which is necessary to generate correct 
object programs. 

A code in v/hich each codeword can be deciphered as soon as all bits of it are 
received is called an instantaneous code. This is a property that we desire for object 
codes. A fundamental theorem of information theory [Shannon48], states that no 
instantaneous encoding of a source may be found which requires on the average fewer 
bits per source symbol than the entropy of the source. Therefore, we wish as small a 
value for the entropy as is possible. There are schemes that allow encodings which are 
quite close to the entropy in average code length [Huffman52] [Pasco763. 

3.2. Markov Sources 

For our purposes, it is too restrictive to consider only zero-memory sources. In 
many cases, the probability distribution of source outputs depends upon the values of 
recent previous outputs. Such sources are called Markov sources, and they require the 
concept of conditional probability for their analysis. 

Let 5 be a source. We will need some notation to describe the sequential nature of 
the source outputs. We will let p.. = P{s.Sj} denote the probability that the symbols s. 
and Sj appear as successive outputs from S. This can be extended to higher order 
Az-tuples, where 

denotes the probability that the n specified outputs occur together. We will use the 
notation p.i . - P{s.\ s^} - p.. / p. for the conditional probability that symbol s. will be 
output given that the previous output was s.. For sources where there are dependencies 
on more than one previous symbol, we similarly let 

r»f . 1 '1'2« • • 'n/ 



Pi\i2, . ./•„ 



denote the probability of s., given that the n preceding were as specified. We can now 
define the order of a Markov source. 
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Definition: Let S be a source with alphabet {jj, ^j, ... , s\ in which the 
occurrence of a source symbol s. may depend upon at most m preceding 
symbols. Such a source is called a Markov source of order m. It is specified 
by giving the alphabet and the set of conditional probabilities 

Pj\ iin- . . im ^'"' J = ^' ^ ^' '> 1. 2.. . . . q. 

3.3. Entropy of a Markov Source 

It is possible to define the concepts of information and entropy for Markov 
sources in a fashion similar to that for zero-memory sources. 

Definition: Let S be a Markov source. The conditional entropy of S is defined 
by 

J 
The entropy of the source S of order m is defined by 

ff(S) = y] p.^ .^ . H{S\ 5. s.^ ...s, ). 
iV2"'im 

There is another formula for the source entropy that is sometimes more convenient to 
use. It is obtained by rearranging the summation above and using the definition of 
conditional probability. In order to avoid notational overload, we will go through the 
rearrangement for m = 1, and simply state the result for general m. 

i 

i J 

= -2 PiPj\i^o^pj\i 

u 

= -2j Pn log Pjif, by definition of conditional probability 
U 
In the case of general m, H is computed by a sum over all (w+l)- tuples: 

H(S)= -2 Pni2^..i^^^^Pj\iii2...im' 
'l'2---'m/ 

In the remainder of the thesis, we will have occasion to refer to either of the above 
formulas for //(5) as uniform formulas. This name is chosen because the formulas 
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uniformly use tuples of a given size. This is contrasted with formulas derived below 
that use an amount of history varying with context. 

We will be dealing with experimental data in which we will never be sure of the 
order of the source. It is possible to perform an entropy calculation on a source as if it 
were of order less than its true order, so we want to know something about how this 
entropy estimate relates to the true entropy. This question is answered by the following 
theorem. 

Theorem 3.1: Let S be a Markov source of order m. Suppose that we assume an 
order k < m, and calculate an entropy value Hf^(S), 

The sequence of values Hj^{S) is monotone nonincreasing, i.e., 

H„(S) > Hi^^^iS) > H(S). 
Note that by definition, Hj^iS) = H(S), for k>m. 

Proof: The proof is straightforward, and found in any information theory book. 
Theorem 3.2, proved below, is a generalization of this theorem. For a explicit 
proof of Theorem 3.1, the reader is directed to Ash, Theorem 1.4.5. [Ash65]. 

Although we will not be concerned with any such sources, it is possible to have a 
Markov source of infinite order. In that case, the entropy of the source is defined to be 
the limit of the sequence {H^(S)}. 

3.4. Considering Trees as the Output of a Markov Source 

Traversal Order 

Consider a source that emits nodes from trees of the form described in Chapter 2. 
One method would be to traverse the tree in preorder (father, then sons from left to 
right), emitting each nodename as it is reached. All nodes except list have a fixed 
number of sons, and the first son of 11st is its son count, so clearly the tree structure 
could be reconstructed from this linearized representation. Given the sequential nature 
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of most programming languages, such an order is well suited for encoding program 
trees. However, for the purposes of entropy definition, any well defined algorithm that 
eventually emits all nodes of a tree is sufficient. One general scheme would use an 
arbitrary collection of previously emitted nodes to determine the order in which to visit 
the sons of a node. A process reconstructing the tree would use the same amount of 
context to determine which son receives the incoming nodes. It should be obvious that 
no traversal will cause the most recently visited nodes to be those that most influence the 
probabilities at the next node. In the next section, we will deal with this problem. 

The mathematical models employed in analyzing Markov sources assume a 
non-terminating supply of outputs. For tree analysis, we will assume that when our 
source reaches the end of one program tree, it starts sending another one. The space of 
possible programs is clearly infinite, but we will hopefully have a large enough sample 
to draw conclusions about the microscopic nature of programs in general. 

Meaning of "Previous" for Trees 

In Markov sources of strings, the concept of "previous" means just what one might 
expect: the output just seen. When we extend the concept to trees, we will meaji not the 
node just traversed, but a "nearby" node at a location defined in terms of the tree 
structure. A small example will help us to see why the standard sense of previous 
symbol is not suitable for trees. The tree in Figure 3.2 represents the IF statement of the 
program from Figure 1.1. 

if stmt 

.— H V 

1*6 1*1 as Sim <einpt¥> 
J L 



/ \ / \ 

dot mma var relE 



/ \ I / si I \ / \ 

var var global 5 1 war num 

1oca3 1 16 field % 16 l^cal 2 « 13 



figure 3.2. Tree representation of an w statement 
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Consider the position of the assign node in the tree. From a preorder traversal, the 
preceding 6 outputs from the source are var , field , Q, 0, j|^, num , and £. Intuitively, 
the appearance of assign is more influenced by the fact that its father is ifstmt . and 
that it is the second son. There might also be an influence from the presence of a test 
for inequality (reIN ). but the most recent nodes, the expression on the right of the reIN 
test, have little influence on the output. It seems best to chose the previous node from 
the set of chronologically previous ones in terms of the tree structure. For example, in 
Chapter 4, the sample data is used to model a source of order 3 with the probability 
distributions conditioned on the elder brother, the father, and the grandfather. 

Experience with a Markov Model for Tree Entropy 

The formalism described above was applied to a set of sample programs in order to 
estimate the entropy of a tree node. The results are summarized in Figure 4.2. A fixed 
traversal order of preorder was used in all cases. Various candidates for previous were 
tried from the set of siblings and ancestors of a node. Estimates were made for orders 
of 1, 2, and 3. The probabilities of the various (m+l)-tuples were estimated by their 
relative frequency in the sample. There were several shortcomings in this approach that 
led to the entropy formulation described later in this chapter. 

• Any choice of previous seemed better in come contexts than in others. No set of 
previous positions was uniformly superior to all others. 

• When the order of the approximation is m, we need estimates of the probability 
of (/n+l)-tuples and conditional probabilities based on m-tuples. As m gets 
large, the number of occurrences of a given /w-tuple in the sample becomes quite 
small. No formal error analysis of the results was achieved, but an 
uncomfortably large portion of the entropy estimate came from (/n+l)- tuples 
that occurred only a few times. 

Fortunately, the above problems can be ameliorated by recasting the entropy 
formula in a form that takes the context into consideration when determining the order. 
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3.5. Non-uniform Formula for Entropy of a Markov Source 

Informal Statement of the Formalism ^ 

The order of a Markov source, say m, is the maximum number of previous outputs 
that influence the next output. In most actual sources, there are sequences of outputs 
where the next output is influenced by fewer than m. For example, in a simple model 
for an English language source, one might consider previous words only back to the 
beginning of the current sentence. 

We wish to compute approximations Hj^{S) for increasing values of K but our 
goals are best served by a formula that does not require extending /:-tuples to A:+l-tuples 
uniformly. Before deriving such a formula, it will be helpful to consider a small 
example. 

Example 




Figure 3.3. State diagram of a simple Markov source. 

The state diagram of Figure 3.3 defines a simple Markov source. Associate 
probabilities with each of the arrows, subject to the constraint that the sum of the 
probabilities for all arrows leaving a circle is unity. The alphabet of S is {a, b}, and the 
order is 2. However, if the previous output was a, the probabilities are not affected by 
what the output was before the a. In other words, P{y\ xd] = P{y\ a}, for all Xyy € S, 
Clearly H{S\ xa) = H{S\ a), for all x £ S. Consider the sum that defines H(S), 
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xy 
= p^^H(S\ ad) + p^^H{S\ ba) + p^^H{S\ ab) + ;7^^//(5| ^6) 

= p^^H{S\ a) + /;^^//(5| a) + p^^//(S| a6) + p^^//(5| 66) 

which the reader can see is a sum over the states of the Markov source of the product of 
a state probability and the entropy given that state. Using such a formulation for the 
entropy requires us to deal with m-tuples only in those cases where all m previous 
outputs affect the value of the current output. 

Entropy Formulation for a Markov Source 

Let 5 be a Markov source of order m with source alphabet {s^, S2,. . ., s }. In 
order to simplify our notation, we will let boldface letters denote vectors, typically of 
order m. If j = (j\, Jj,. . . , j^), we shall use the abbreviation 5: for the m-tuple of 
alphabet characters (s. , 5. ,. . . , s,. ). Let p: be the probability of occurrence of the 
m-tuple Sj, and let p.i: be the conditional probability of s^ given that s: occurs 
immediately before. The entropy of S is defined by 

H{S)= 2 Pi ^(-^l^j) 
J 

= -S pi S p/|j log (p/|j)- 

j ' 

Suppose now that dependencies of order m are not needed throughout the entire range of 
source output. That is, suppose that there is some ;?-tuple k such that for all 
(m-/?)-tuples r, and all / €{1, 2,..., q}, 

Pi\rk = Pi\k' 

The sum defining H{S) can be partitioned into that portion of vectors j ending in k, and 
all the rest. Consider now the portion containing k. 
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2 Prk ^(-^1 ^rk) 
r 

= -S Prk 2 Pi\rk ^og(/>,|rk) 
r / 

= -H PrkU Pi\k^^^(Pi\k> 
r / 

= "2 /'/Ik 'og(^/|k) 2 ^rk 
/ r 

^-Pk 2 Pi\k^''^(Pi\k^ 

i 

Hence we see that in calculating the entropy of a source of order m, the sum over all 
/r2-tuples can be replaced by a sum over vectors of length</n. 

3.6. A Non-uniform Entropy Formula for Trees 

Generalization of the Non-uniform Entropy Formula 

In the derivation of the non-uniform entropy formulation for Markov sources, the 
simplification came from considering a p-tuple k, such that all w-tuples ending in k had 
the same conditional probabilities for their next output. The derivation actually made 
no use of the fact that k was contiguous, and at the end. Let us try to generalize those 
results. 

Let y^S be allowed to match any source output. Engineers call such a variable a 
don't care condition. Let <p be an m-tuple over S U{y}, which we will call a pattern. 
Define containment as follows: If si is an m-tuple over S, then q)Csi if for each <Pi^y, 
<p- = s.. Suppose that a set of patterns 4> has been chosen with the properties: 

• The elements of <I> form a partition of 5". That is, for every m-tuple 5:, there is a 
unique <p€<>, such that <pC5: and qp has a minimal number of elements equal to 
Y- 

• For all <p€<^, if w-tuples s-. and s^ are both in the equivalence class defined by q> 
through the above partition, 

Pi\l = P,|k. ^orall /. 
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Then we can restate the entropy formula as 



where p = ^jiP'A^ i" the equivalence class defined by 9}. 

Tree Formula 

Suppose that 5 is a source of tree nodes with a well defined traversal order. Let a 
pattern be a partial subtree, possibly with don't care elements, denoted *. The pattern 
has a node, denoted @, which is the last node of the pattern in traversal order. We will 
be interested in the probability distribution of nodenames that occur in this @ position, 
given that the rest of the pattern is matched. The location of the assign node in 
Figure 3.2 might correspond to the pattern 
if stmt 



/ \ 
relN 

meaning the second son of an ifstmt whose first son is a reIN test. Asterisks for the 
third son of ifstmt and the sons of reIN are omitted. Suppose that we can find a set 
n of patterns, such that for each w€n, the conditional probabilities of the values of @ 
are independent of any previous context other than that specified in tt. Suppose further 
that for each node in a possible tree, there is some unique 7r€n, such that the node is in 
the @ position of tt. If the patterns are allowed to overlap, we will require an effective 
algorithm for determining which pattern to associate with a given tree position. By 
reasoning similar to that of the previous section, 

Tren 

In order to estimate H, we need to find a traversal order, a suitable pattern set 11, and 
we need to estimate the probabilities p^, and P,|^. The comments above about 
overlapping patterns above might lead one to the conclusion that such situations are to 
be avoided; on the contrary, the pattern refinement procedure discussed below generates 
large numbers of nondisjoint patterns. 
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3.7. Pattern Refinement 

In Section 4.4 we will describe a process called pattern refinement, in which the 
nodes matching a pattern it are partitioned into two sets: those also matching a new 
larger pattern v, and the remaining ones. This section contains a proof that pattern 
refinement always leads to an improvement in the estimate unless v has the same 
conditional probabilities as w. It is not even necessary for the refining pattern v to have 
a lower entropy for its matching nodes than the original. In order to prove the theorem, 
a few elementary results from probability theory and information theory will be needed. 

Lemma 3.1. Let A, B, and C be events. Then 
P{A and B\C} = P{A\B and C}- P{B\C}. 

Proof: The result follows by two applications of the definition 

P{X\ Y} = P{X and Y} / P{Y}, 
with r = 5 and C, then Y ^ C, See [Feller68]. page 116. I 

Lemma 3.2. Let {Pi,P2>^ - 'P„} ^nd {^i,^2'* * • ^^^ ^^ ^^''^ of probabilities. That is 
p^>0 and qj>0, for ail /, and 

Then the formula 

-2 Pi Jog/^/ < "2 Pi log Qi 

i i 

holds with equality if and only if p. = q^, for all i. 

Proof: This is a well known result, following from the inequality In jc < jc-1, with 
equality if and only if x = 1. See, for example, [Abramson63], formula 2-8b. I 

Theorem 3.2. Let ir be a pattern, and v a pattern containing ii. Let it' be the 
pattern defined by "matching ir, but not matching v". Then the component of the 
entropy estimate contributed by the two disjoint patterns v and tt' does not exceed 
the component of the pattern u that they replace. In other words, 
P^ mSlir) > p^ H(S\p) + p^, //(S|7r'). 



40 



Chapter 3: Finding an Entropy Model 

Furthermore, equality holds if and only if P{s.\v} = Pis-lir} for all /. 

Proof: The general approach of the proof is to divide the probabilities of various 
nodes into a v component and a it' component. There will be some nodes for which 
one of the components will be zero. We will adopt the convention that log = 
in order to simplify our notation. 

Let a = PIvItt}, = P{it'\it}. Clearly a+/8 = 1. 

For each node s.£S, let p^ - P{s.\it}. Decompose p. as follows: 
Let Pj = q. + r., where q. = P{s^ and vlir}, 

r. - P{s. and it' I w}. 

Clearly 2 ^/ = "» ^^^ S " ^- ^^^°' ^^ Lemma 3.1 and the fact that matching y 
i i 

or w' implies matching it, 

Pis^ln'} -^', 
Consider now the contribution of tt to the entropy estimate. 

p^ H(S\it) = -p^ 2 Pi ^^^ Pi 
i 

= -^P'n 2 1' 'og Pi - ^Ptt 2 F ^°S P/ 
/ i 

= -/'^ 2 f 'og /^/ - /^^' 2 J '^s P/ 
> -/',. 2 ^'" log 1' - /'tt- 2 ^' ^og^'' 
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q. r. 
From Lemma 3.2, equality holds if and only if— =7r = P, for all /. Since ^ = 

q. r. 

\-a and ^.+r. = p., -' = p. implies -g-' = p^. I 

Although Theorem 3.2 is stated in terms of tree patterns, the concept is equally 
valid for any non-uniform entropy formula where a partition of the possible source 
output configurations is refined by splitting one of the equivalence classes into two 
disjoint pieces. 

This chapter has contained most of the formal mathematics of the thesis; Chapter 
5, on error analysis, also contains a few results. 
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4.1. Sample Description 

Obtaining a Representative Sample 

When it was decided to apply the framework of information theory to the 
empirical study of programs, the Mesa language and its compiler were still under 
development, so there were very few source programs. Thus, the analysis programs were 
being developed as the language was maturing and as a collection of source programs 
were being written. Writing software that depends upon the internal workings of a 
compiler under development by others is a frustrating task, but eventually enough 
programs existed to provide a suitable sample for analysis. At this point the compiler, 
all support routines, and the entire set of available source programs were saved away on 
tape and this frozen universe was used for analysis and for refinement of the analysis 
tools. The sample lasted for almost 5 months, and for several iterations of the analysis 
programs. These were the Pass 3 trees described in Chapter 2. When Pass 4 trees were 
chosen for study, the changing language had stabilized, so a new larger sample was made 
from essentially all existing source programs. 

The compiler and runtime system are all written in Mesa itself. Since it is a young 
language, these comprise the majority of the programs written. The language is intended 
for system implementation, so these programs are probably representative of the ones 
which will be written in the future. The sample was 114 programs, of which 
approximately half were the multipass compiler. The remainder were runtime support 
and utility software. 
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Sample Size 

The physical characteristics of the set of sample programs are enumerated below. 
Since length of identifiers varies with programmers, a count of the tokens as returned by 
the lexical scanner of the compiler was also taken as a measure of source program size. 
Tokens included identifiers, operators, numbers, reserved words, etc. The count of lines 
was made as an afterthought and in some cases represents a count of lines in a later 
version of a given program. 

Number of Programs: 114. 

Lines of Source: approximately 37,000. 

Characters of Source (excluding comments): 992,030. 

Tokens of Source: 199,406. 

The programs were written by 5 programmers, all of considerable experience. The size 
of the programs varied from small collections of shared definitions to large compiler 
modules. The smallest was 165 characters (28 tokens), and the largest was 31,696 
characters (7119 tokens). The mean number of characters per program was 8090, with a 
standard deviation of 7150, indicating a large variability in program size. The following 
information is irrelevant to our further discussion, but is included for interest. 

Average length of identifier (excluding reserved): 8.12 chars 
Average proportion of comments: 7.9% 

The programs are low on comments, but the programmers compensate by choosing long 
descriptive names for variables. The mean of 8.12 characters is quite large when one 
considers the relatively frequent use of short variables such as /and j. Some definitions 
modules had an average identifier length in excess of 13. These modules also had a 
higher percentage of comments, sometimes in the range of 30 to 40%. 

Existing Encodings of the Sample 

Our goal is to see how little code is necessary to represent a program. It is useful 
to have a benchmark against which to measure our ability. Since code compression 
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comes often at the cost of increased complexity in the encoding and decoding processes, 
knowing the cost of a naive encoding would be helpful. Certainly the output of the 
existing compiler can serve as an upper bound on the amount of space needed to encode 
a program, but this does not reflect a naive encoding since an effort was made to have 
the compiler produce compact object code. The other representation of the programs 
that we have is the trees described in Chapter 2. The actual representation of these trees 
used by the analysis routines was designed for ease of incremental updating of various 
data bases, so the size of the set of trees is not the correct measure. We will see in the 
next section that a naive, but compact encoding of these trees would require at least 10 
bits per node. 

Code generated by the compiler: 944,512 bits. 

Tree representation: 296,895 nodes- 10 bits per node = 2,968,950 bits. 

4.2. Applying Uniform Order Markov Model 

Zero-memory Source Model 

The simplest model for encoding the trees is to assume that the various tree nodes 
are independent Each node / is assumed to occur with probability p.. From the 
definitions of Chapter 2, we can calculate the entropy //q(5) by the formula 

f^o(^ = -S Pi Jog P.. 
This model provides some insight into the use of the language. Of the 296,895 
nodes in the sample, there were 1037 distinct nodenames. While this number seems 
large, recall that it also includes programmer-specified numbers and strings. There were 
556 nodes which occurred only once, 399 of them string literals. We will consider the 
effects of literals in the next section. The entropy estimate was 4.75 bits per node, quite 
a bit smaller than the slightly over 10 bits needed to represent 1037 symbols of equal 
frequency. 

Figure 4.1 shows the 35 most frequent nodes together with their contribution to the 
entropy calculation. The most frequent nodes are those that were invented to put the 
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information about variables into the tree. No meaningful conclusions can be drawn 
about use of numerical literals, since most of the O's, I's, and 16's are sons of var nodes, 
the node which was invented to put information from the symbol table back into the 
trees for analysis. We can see that the static usage of variables by frame location is 
ordered: local , global , field , and entry . The most popular statement nodes are 
can , assign , ifstmt . and return . Some of the list nodes refer to compound 
statements. The most popular expression nodes are: var . call , dot (field selection via 
pointer), dollar (field selection from variable), plus , reTE (test for equality), and 
uparrow (obtaining value of cell from pointer to the cell). The node call can be 
either a statement or an expression; we cannot tell from these zero-memory statistics 
which is which. The first 35 nodes (3% of the total) accounted for 93% of the node 
usage and 84% of the total entropy estimate. 



nodename 



var 

16 

local 

empty 

global 

1 

num 

2 

field 

list 

14 

call 

assign 

dot 

3 

plus 

4 

dollar 

entry 

relE 

S 

5 

ifstmt 

return 

body 

6 

32 

uparrow 

7 

item 
9 
15 
11 

in 



count 

54280 
40289 
26481 
17580 
13917 

12279 

11293 

10220 

9230 

7744 

7425 
7054 
6616 
5826 
5726 

5428 
4039 
3309 
2770 
2686 

2348 
2245 
2189 
1810 
1683 

1513 
1385 
1341 
1270 
1264 

1214 

1178 

929 

928 



p -p log p 2-/> log p 
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136 

089 

.059 

,047 

,041 
038 
034 
,031 
,026 

025 

024 

022 

,020 

,019 

,018 
,014 
Oil 
009 
,009 

008 
,008 
-007 
,006 
,006 

,005 
.005 
,005 
,004 
.004 

.004 
.004 
.003 
.003 



448 
391 
311 
241 
207 

190 
179 
167 
156 
137 

133 
128 
122 
111 
110 

106 
084 
072 
063 
,061 

,055 
,053 
052 
045 
,042 

,039 
,036 
,035 
,034 
,034 

,032 
.032 
,026 
,026 



.448 

.839 

1.150 

1.392 

1.599 

1.789 
1.968 
2.135 
2.291 
2.428 

2.561 
2.690 
2.812 
2.923 
3.033 

3. 
3, 
3. 
3. 
3. 

3.475 
3.528 
3.580 
3.625 
3.667 

3.706 
3.742 
3.778 
3.811 
3.845 

3.877 
3.909 
3.935 
3.961 



139 
223 
295 
358 
420 



Figure 4.t. Excerpt from the zero memory entropy calculation. 
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Dealing with Programmer-defined Literals 

In the Markov studies performed for this thesis, certain simplifying assumptions 
were made about literal constants. It was assumed that the numbers and string constants 
seen in the sample defined the range of all possible values for numbers and string 
constants. For numbers this was not too bad an assumption. A vast majority of the 
numbers fell into a small range, making the entropy estimate for those outside the range 
unimportant to the total estimate for numbers. With strings, however, this was not the 
case. Of the 449 literal strings in the sample, the most frequent occurred 4 times, and 
only 30 occurred more than once. In Chapter 5, on error analysis, a more conservative 
estimate of entropy for string constants is factored back into the final estimates in the 
more general entropy model (they increase the total estimate by 4%). 

Uniform Markov Estimates 

In Chapter 3, we saw that the standard definition of "previous" for Markov sources 
was not the proper definition to use for trees; the notion should be defined in terms of 
the tree structure. Figure 4.2 shows the results of applying a Markov source 
approximation of order m to the sample for various values of m, and for various 
definitions of "previous." The formula used to compute the entropy was 

where k denotes an m-tuple over the source alphabet. In order to obtain a number, one 
must approximate pu. for all (m+l)-tuples which occur in the sample. For those nodes 
with a low frequency of occurrence, the relative error in the approximation is likely to 
be larger. In Figure 4.2, / is the frequency of the (/7?+l)-tuple in the sample. The last 
two columns show the percentage of (m+l)-tuples which occurred fewer than 6 times, 
and their contribution to the total entropy estimate. In the table, brother refers to the 
immediately preceding brother in the tree. For a node that is a first son, the value of its 
brother is null. 
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order 
m 




definition 
of previous 


H 

estimate 

4.752 


#of distinct 
(m+l)- tuples 

1037 


% tuples 
with / < 6 

72 


%H from 
/<6 

1.4 


1 


father 


3.180 


1945 


60 


2.3 


1 


brother 


3.158 


2652 


61 


3.4 


2 


grandfather, father 


2.802 


6507 


66 


6.4 


2 


father, brother 


2.053 


4002 


62 


4.9 


3 


grandfather, father, 
brother 


1.662 


11954 


61 


12.2 



Figure 4.2. Results of uniform Markov estimates on trees. 

The results for the zero memory approximation have also been included in Figure 
4.2 for comparison. Another interesting number is the average number of bits of object 
code per node as generated by the existing coriipiler. There were 296,895 nodes in the 
trees and 944,512 bits of code generated, yielding an average of 3.18 bits per node. This 
shows that the Markov estimates and the current compiler are of similar magnitude. It 
also reflects the considerable thought that went into the design of the current object code 
in order to keep size to a minimum. 

As the order of the Markov source approximation increases, the entropy estimate 
decreases. At the same time, however, the amount of the estimate which comes from 
low frequency terms increases. We also see that the choice of previous as well as the 
order affects the results. Figure 4.3 shows the 35 most frequent 3-tuples in the second 
order estimate with father and brother as previous nodes. 

Several things can be learned by examining Figure 4.3. The most frequent nodes 
are not necessarily the ones which contribute the most to the entropy estimate. If the 
conditional probability Pji^ is close to unity, log Pjt^ is small, offsetting the large p^. 
value. If the 3-tuples were ordered by contribution to H instead of frequency, the total 
for the top 35 3-tuples would be .941 instead of .723. The 3-tuples (dot, var . var) 
and (dot, plus , var ) each contribute zero to the entropy. This is true not because of 
the value of the brother (var and plus , respectively), but because dot has 2 sons, the 
second of which is always var . In 40% of the cases shown, the brother is null, telling us 
only that the node j is a first son. The other two entries with an entropy contribution 
of result from the "boundary conditions" on the traversal: all trees begin 
body[list[ 
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Figure 4.3. Excerpt from Markov source estimate of order 2. 

There are definitely dependencies of the nodes upon the values of the neighbors 
within the tree. The high percentage of entropy coming from m-tuples of low frequency 
in the models above suggests that simple uniform attempts to capture the dependencies 
are not adequate. Below we will apply the non-uniform formulation for entropy 
discussed in Chapter 3. 

4.3. Estimating the Entropy by the Non-uniform Entropy Formula 

General Methodology 

We wish to choose a set of patterns so that every node in a tree can be associated 
with a unique pattern, depending on nodes which appeared earlier in a traversal of the 
tree. For our investigations, we shall always be using preorder traversal. One can think 
of the patterns as defining a set of states of an encoding process. The ultimate goal is 



49 



Chapter 4: Applying the Entropy Model to the Sample 



to have a set of patterns such that all previous context that significantly affects the value 
of a node is contained in the pattern associated with that node. If our current set of 
patterns does not have that property, our model is one which assumes a smaller local 
order than the actual local order of the source. This corresponds to an overestimate of 
the true entropy, but for practical purposes, we will not have a sufficient sample set to 
find all dependencies accurately. We will, however, have a bound for each pattern ir 
and we can use it to decide whether it is useful to pursue larger patterns containing ir. 

Description of the Patterns Used 

Recall from Chapter 3 that a patterns specifies a context reached by a partial 
traversal of the tree. The pattern has a distinguished node, denoted @, which is the next 
node to be visited in the traversal. As long as each point in the tree is associated with a 
unique pattern, we need not be limited to simple equal matches in the definition of our 
patterns. For example, one could have the pattern 

{assign|assignx} 

/ V 
* var 



/ \ 
local @ 

which selects the address of local variables appearing as the right side of either 
assignment statements or assignment expressions. The patterns need not be disjoint. If 
TT and V are both patterns, with rnQv, then there are some places within trees that match 
both patterns. In fact, all places matching y also m.atch it. We do not wish to associate 
a given node with more than one pattern, so we will think of the pattern m as matching 
all situations which do not also match v. Since a 1 ist node can have an arbitrary 
number of sons, we need some way to deal with this freedom in our choice of patterns. 
This is accomplished by allowing a node in the pattern to be preceded by an arbitrary 
number of brothers. Thus the pattern 

list 



/ I \ 
♦ ... 

selects a node which is a son of list, but not the first son. 
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Notation 

We will be talking about many patterns in the remainder of the chapter. While the 
tree diagrams are helpful, they are not very compact. We will adopt a "function-like" 
notation to specify patterns, which is clearly equivalent to the trees. In this notation, the 
above patterns would be written: 

{assign|assignx}[*,var[local ,0]] 
list[*,...0]. 

Initial Pattern Set 

In order to "grow" a set of patterns, we must chose a starting point. The most 
conservative approach would begin with a zero-memory model, i.e., the single pattern 
"0". But nodes seem to depend upon at least their father, so we shall start with a first 
order model. Such a set of patterns can be generated quickly and easily by a 
modification to the programs which produced the uniform Markov estimates. 

The starting pattern set specifies for each node in the tree its father and the son 
position which the node occupies. For example, the node ifstmt has 3 sons, so the 
initial pattern set contains the three patterns: 

ifstmt[@]. ifstmt[*.@], and if stmt[*,*,0]. 
The variable length 11st node requires special handling. The first son is a count of the 
remaining sons, and should be treated differently from the later sons. In the initial 
pattern set, all non-initial sons of list are considered to be in the same pattern. 
Hence, there are the two patterns: 

listC©]. and list[*,. ..©]. 

Entropy Estimate from the Initial Pattern Set 

For each pattern it, there is the set of nodes in the sample associated with it. These 
nodes, considered as the output of a zero memory source, have an entropy H(S\it). If 
the probability that a node is associated with pattern tt is /)^, then the pattern has a 
contribution to the total entropy estimate of p^H{S\'n). Experimentally, we estimate p^ 
and H(S\it) by assuming the probabilities of patterns and nodes to be their relative 
frequency in the sample. 
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(2) var[*,e] (address of variables) 
f =40289 H = 3.680, p = 
154 cases (11 shown) 

12914 .32 

1 6658 .17 

2 4014 .10 



3131 



08 



(5) list[*,.. .@] (element 
f = 22481 H = 3.763, p = 
72 cases (11 shown) 
assign 5085 .23 

call 3573 .16 

num 2623 ,12 

var 2420 .11 



(4) var[*,*,*.@] (length of variables) 
f = 40289 H = 2.099, p = .136, H contr. 
67 cases (11 shown) 
16 
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1 
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611 .02 11 357 
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414 .01 3 334 

381 .01 «others» 2195 



(1) var[@] (frame of variables) 
f = 40289 H = 1.762, p = .136. H contr. = .239, cumul. H 
4 cases 
local 17580 .44 field 7744 .19 
global 12279 .30 entry 2686 .07 



1.308 



(6) num[@] (programmer specified numbers) 
f = 10220 H = 5.132, p = .034, H contr 
408 cases (11 shown) 

2254 .22 

1 1786 .17 

2 748 .07 



177, cumul. H » 1.485 



537 



,05 



(12) assign[*,@] (right hand side of assignment statements) 
f = 5826 H = 3.591, p = 
42 cases (11 shown) 
call 1288 .22 
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(9) can[*,@] (procedure call actual parameters) 
f = 6616 H = 2.961, p = .022, H contr. = 
27 cases (11 shown) 
list 1816 .27 dot 571 

var 1739 .26 dollar 209 

<empty> 760 .11 str 204 



.025, H contr. 



(7) list[@] (length of lists) 

f = 7425 H = 2.476, p = 

38 cases (11 shown) 

2 3599 .48 

3 1139 .15 
1 1038 .14 

4 591 .08 



(3) var [♦,♦,0] (bit offset of variables) 
f = 40289 H = .336, p = .136. H contr. 
16 cases (3 shown) 
38841 .96 3 
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Figure 4.4. Excerpt from initial pattern entropy calculation. 
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In the initial pattern set, there are 154 patterns. The most frequent one occurs 
40,289 times, and the least frequent one 2 times. There are 12 patterns which occur 
fewer than 25 times. The estimated entropy per node is 2.058 bits. This number is very 
close to the uniform Markov model of order 2, with father and brother as the previous 
nodes. This is not surprising, since we saw above that the brother in that model was 
often simply specifying the son position. Figure 4.4 contains a portion of the output 
from the entropy calculation, showing patterns, their contribution to the total, and the 
distribution of nodes associated with those patterns. For purposes of this figure, only 
the frequently occurring nodes are shown for a pattern. See Appendix C for a more 
complete listing. 

A few words of explanation are needed for Figure 4.4. The number in parentheses 
is the pattern number. For the initial pattern set, they are assigned in order of 
decreasing frequency. As patterns are added to the set, they are given numbers 
sequentially. The description in parentheses tells what source constructs give rise to trees 
matching the pattern. The H figure given is the entropy estimate for the nodes 
associated with the pattern. The patterns are sorted by H contr, the product of p and 
//. The cumul. H value is the running total of H contr. for the patterns. The nodes 
matching the pattern are shown in three columns, giving the nodename, its frequency, 
and its relative frequency. 

In the uniform Markov source models, it was difficult to relate the dependencies to 
the actual statements which appear in programs. The patterns show the correspondence 
in a more natural way. One should remember, however, that these are static counts, 
made on the basis of occurrence in programs, not dynamic counts taken of nodes as they 
occur during execution of a program. 

From pattern 12, we can see that 32% of the right-hand-sides of assignments are 
simple, with the value assigned either a simple variable or a number. Procedure calls 
account for 22% of the right-hand-sides, and field selection (dot and dollar ) account 
for 14%. The 5% figure for < empty > occurs in extract statements, e.g., [vl,v2] «- f[]. 
The compiler parses the left side of such statements into a list of empty assignments. 
The count for simple variables also includes subscripted variables with constant indices 
that can be "folded" into a simple variable. 
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In pattern 9, we see that only 27% of all procedure calls take more than 1 
argument, and that 11% of the procedure calls are parameterless. Of the procedure calls 
with only one argument, 60% take a simple variable or a number. 

Pattern 7 matches the length specification for list nodes. Short lists predominate. 
Even without other encoding, one could save space by defining new nodes 1 ist2 . 
1ist3 , etc. to handle the common cases and using just about any reasonable scheme for 
longer lists. 

Pattern 3 matches the bit offset of variables. It is overwhelmingly zero. We will 
later see that only a small amount of further context suffices to separate all the nonzero 
instances. 

Structuring the Sample to Facilitate Matching 

A data structure was developed for the analysis programs that greatly simplifies the 
pattern match procedure, the refinement procedure, and the transformation procedures 
described below. With the first pattern matching facilities used, a problem developed; as 
the size of existing patterns grew, the computation required for finding all instances of a 
set of refinement patterns took 30 minutes to an hour of CPU time; and it was clear that 
some of the extensions of the analysis software could not be run at all. The solution to 
these problems is quite simple: in the sample trees, we store a value associated with each 
node that identifies which pattern that node matches. Since the analysis programs run 
on a 36 bit machine, there was enough unused room in a node to store a pattern 
sequence number without increasing the storage occupied by the sample. This is 
essentially the data processing technique of inverting a data base. Figure 4.5 shows the 
ifstmt tree from Figure 3.2, with the number of the pattern in the initial pattern set 
that each node matches. 
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[5] 
IT stmt 

[21] [h] [23] 

relN assign <empty> 



[36] [37] [11] [12] 

dot nuro var relE 

I I I » 

[13] [14] [6] [1] [2] [3] [4] [19] [20] 

var var global 5 1 var num 



[1] [2] [3] [4] [1] [2] [3] [4] [1] [2] [3] [4] [6] 

local 16 field 16 local 2 16 13 



Figure 4.5. Parse tree showing pattern sequence numbers. 

For the numbers to be useful, any procedure that places a node in a new pattern 
must update the pattern sequence field in the tree. This consistency maintenance is far 
simpler than general pattern matching. By writing all such changes to the sample onto a 
file, it is also possible to "undo" the effects of a new pattern. 

4.4. Improving the Pattern Set 

Finding Larger Patterns 

The output from the entropy calculation gives the contribution of each pattern to 
the total entropy estimate. We can lower the estimate by finding, for a given pattern, 
situations where a larger context produces a distribution of nodes with a smaller 
entropy. For example, we see from pattern 7, Figure 4.4, that 48% of all lists are of 
length 2, and the entropy for the distribution of lengths is 2.476 bits. If however, we 
single out those lists which are parameter lists of call statements (approximately a 
quarter of all lists), 77% are of length 2, and the distribution of lengths has an entropy 
of only 1.085 bits. 

Several schemes were used for finding such larger patterns. A more detailed 
description of the program finally used is given in Appendix B. Below is a brief listing 
of the features of the pattern match facility. 
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1. The entire collection of trees is traversed in preorder. Several auxiliary data 
structures keep a record of ancestors and siblings. 

2. For an existing pattern, further restrictions may be made by specifying a list of 
predecessors to have given nodenames or son displacements. 

3. For a pattern, perhaps restricted as in 2, the user can specify a collection of 
historical information (nodenames or son displacements of predecessors) on 
which he wishes to obtain statistics; they are automatically obtained for the 
nodes in the "@" position of the pattern. 

4. One may also specify a collection of pairs of the items selected in item 3 above. 
For example, grandfather and "@" node. 

5. As each node is encountered which matches the (possibly restricted) pattern, all 
requested information is stored in a hash table. 

6. The output of the match gives the distribution of nodes or values as requested in 
item 3. For a pair of items («,j8), the distribution of items )8 is given for each 
value of a. 

For patterns only one node larger than existing ones, the pair statistics from item 6 
point out potential entropy reducers. For more complicated patterns, one can restrict an 
existing pattern by item 2 and iterate the match process. Other features of the match 
procedure allow several patterns to be investigated at once, allow the same statistics to be 
taken on a set of existing patterns, and allow several independent sets of statistics to be 
taken on a given pattern during a single traversal of the sample set. 

Pattern Refinement 

We will use the term pattern refinement to denote the process of "growing" 
patterns. A pattern ir with a large entropy contribution is explored using the techniques 
of the previous section. A set of larger patterns v^, »'2'-**' *'** containing tt, are 
discovered which have node distributions of lower entropy than that of tt. The 
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refinement procedure removes the nodes matching v-^ from those of tt. This leaves a 
new pattern tt', defined as those nodes matching tt, but not v^. We know by Theorem 
3.2 that this will lead to a lower entropy estimate. 

A program related to the matching routines of the previous section allows 
incremental updating of the entropy estimate. The entire set of trees is traversed. When 
a node matching tt is encountered, a test is made to see if it also matches v^. If so, the 
node is marked as matching r^., and the triple composed of ir, v^, and the nodename is 
added to a hash table. In later sections of the thesis, we will refer to this triple, together 
with a count of occurrences, as a transaction. Another procedure processes the 
transactions, making changes to the node list of all relevant patterns. Figure 4.6 shows 
the transactions produced by the niatching procedure and sample output from the update 
procedure when pattern 7, the length of lists, is refined. 



TRANSACTIONS: 








old 


new 






pattern 


pattern 


nodename 


count 




218 


2 


80 




219 


2 


1390 




219 


3 


263 




219 


4 


121 




219 


5 


42 



BEFORE: 

(7) list[03 
f = 7425 H = 2.476, p = 
38 cases (11 shown) 



,025, H contr. » 



.062 



2 3599 .48 


5 


322 


.04 


10 


61 


.01 


3 1139 .15 


6 


208 


.03 





60 


.01 


1 1038 .14 


7 


113 


.02 


9 


54 


.01 


4 591 .08 


8 


80 


.01 


<<others» 


160 


.02 


AFTER: 














(7) list[9] 














f = 5529 H = 2.786, p = 


.010. 


H contr. « 


.052 








SEE ALSO: f H p 


coBtr 












(218) 80 0.000 .000 


0.000, 


arraydesc[li 


stroll 








(219) 1816 1.085 .006 


.007. 


can[*.list[0]] 








38 cases (11 shown) 














2 2129 .39 


5 


280 


.05 


10 


61 


.01 


1 1038 .19 


6 


208 


.04 





60 


.01 


3 876 .16 


7 


113 


.02 


9 


54 


.01 


4 470 .09 


8 


80 


.01 


<<others>> 


160 


.03 



(218) arraydesc[list[S]] 
f = 80 H = 0.000, p = 

2 80 1.00 

(219) call[*,list[@]] 

f = 1816 H = 1.085, p = 

4 cases 

2 1390 .77 

3 263 .14 



.000, H contr. = 0.000 



,006, H contr. 

4 
5 



121 
42 



.007 

.07 
.02 



delta H = 



,003 



Figure 4.6. Example pattern refinement. 
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The description of results shown in Figure 4.6 are perhaps a little cryptic. In the 
"after" case, pattern 7 has been refined to mean all those lists that do not also satisfy 
patterns 218 and 219. Pattern 218 is particularly interesting. It says that a list under an 
arraydesc node has two sons (with probability 1), and no information need be encoded 
about the length of these lists. 

Results of Pattern Refinement 

The procedure for lowering the entropy estimate by pattern refinement is quite 
easy to characterize. 

• The pattern matcher is used to find possible refinements of patterns with large 
entropy contributions. 

• The refinement procedure produces a list of transactions on the pattern set. 

• The entropy update procedure produces a new estimate for the affected patterns. 

• The process is repeated. 

Of course, as the Sorcerer's Apprentice learned, there needs to be some means for 
terminating the loop. The ridiculous extreme would be to find a huge set of patterns 
where nearly every pattern determines a unique member of the sample set. This 
approaches an encoding of programs by assigning them sequence numbers, an encoding 
that would be undefined on a program never seen before. The "rule of thumb" actually 
used was to consider a pattern tt only if the number of instances of tt was large 
compared to the range of values for the nodes associated with it. In Chapter 5, we will 
see a better stopping criterion in which the variability of the sample is gauged in order 
to decide on the advisability of a particular refinement. 

Such a process could be automated, but would require some care in choosing the 
predecessors of a pattern when looking for larger patterns. For the purposes of this 
thesis, the refinement was done with a "programmer in the loop." This has its 
advantages, since some patterns can be produced by knowledge of the compiler. 
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The initial set of patterns contained 154 patterns, with the most "costly" pattern 
having an entropy estimate contribution of .499 bits. After applying the refinement 
procedure repeatedly, there were 268 patterns, with a highest contribution of .095 bits. 
The entropy estimate went from 2.058 to 1.642. This not only yielded a slightly lower 
value than produced by the uniform Markov approximations, it also reflects a 
conservative policy of only choosing refinements when the error introduced seems 
small. Section 5.8 contains an analysis of the potential inaccuracy of the estimate. 
Appendix C contains the actual data from the final set of patterns, including those 
additional modifications described in the next section. 

4.5. Program Transformations 

Sometimes a problem becomes much simpler to solve when approached from a 
different direction. Mathematicians use this concept when they apply a "change of 
variables" to a problem. The same is true of program encoding. For example, when 
Clark and Green were studying the encoding of list structures in LiSP [Clark77], they 
found little regularity in the absolute value of CDR cells. When, however, the pointers 
were made relative to the address of the list cell, it became apparent that a most of CDR's 
point nearby in memory, and that many point to the adjacent cell. This relativizing of 
pointers is what can be considered an invertible transformation of the program. Such a 
transformation leads to a lower entropy estimate, since the model of the source 
corresponding to this new representation better reflects the dependencies in the data. 

With a slight modification to the entropy update procedure, we can provide 
machinery to allow rather general invertible transformations of the sample trees. Recall 
that a transaction is a triple of old pattern, new pattern, and node name, together with a 
count. If we simply allow either the old pattern or new pattern field to be zero, we have 
the capability of adding or deleting nodes in the set of trees and incrementally updating 
the entropy estimate. A concrete example will help us to see what else is needed to 
facilitate transformations. 
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Transforming "bump" Trees 

Every programmer knows that programs contain many statements of the form 
a <- a+1. In fact, quite a few computers have instructions for incrementing a memory 
location. It is not uncommon for a compiler's code generator to generate such an 
instruction when applicable. We might therefore wish to encode the statement more 
compactly than the standard tree representation of assign[a,plus[a,num[l]]. By 
suitable modifications to the matching procedure, we can find all such trees. Let us 
devise a scheme for transforming one into bump[a], and see what other tools are 
needed. One must be careful, of course, that a causes no side effects, such as a field 
pointed to by a procedure call. The "before" portion of Figure 4.7 shows the tree for 
such a statement. The numbers in square brackets are the patterns associated with the 
various nodes. 



Before 

[259] 

assign 

I 



After 

[259] 
bump 



/ 

[11] 
var 

L_ 



\ 

[12] 

plus 

J_ 



/ T I \ / 

[167] [196] [3] [229] [15] 
local 16 var 

> r-L- 



[269] 
var 
L 



[16] 
num 



/ [ r \ 

[1] [156] [3] [160] 
local 16 



[1] [204] [3] [160] [191] 
local 16 1 

Glossary of pattern numbers 



1. var[@] 

3. var[*,*,@] 

11. assign[@] 
12. 

15. plus[@] 

16. plus[*,@] 
156. var[local,@] 
160. var[local ,*,*,0] 



aoo-irir»r* flT 



167. assign[var[0]] 

191. plus[*,num[@]] 

196. {assign|assignx}[var[loca1 ,@]] 

on/i ,» 1 ..« r..« .^ n ^t ci~i-t 

L.\j-^. p I uo^ vai L lui^a I , cj J 

229. assign[var[local .♦,*,@]] 

259. item[*Jist[*, . ..©]] 

269. bump[@] 



Figure 4.7. Example of a program transformation. 



Ignore for the moment the pattern numbers given in brackets. In order to change 
the trees to reflect the new bump construct, we need only insert a single node, bump , in 
the tree, change the pointer that previously pointed to assign , and have the son of 
bump point to the old first son of assign . In addition to transforming the tree, we 
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wish to update our pattern data base incrementally to reflect changes in the entropy 
estimate caused by such a transformation, and we wish to keep the pattern sequence 
numbers consistent with the patterns matched by the nodes. For the newly created bump 
node, the matching pattern is the same as that of the old assign node. The var node is 
assigned a new pattern number, that for son of bump . All the nodes of the subtree 
rooted by plus are discarded, so transactions are generated for them with a new pattern 
field of 0. The sons of the other var node are much more difficult to deal with because 
they match patterns involving the grandfather, which is assign in the original trees. 
Since we are taking away the assign node, we have to associate these sons with smaller 
patterns not involving the value of their grandfather. This is simplified by having each 
newly generated pattern remember which smaller pattern or patterns it refines. If we 
define the depth of a pattern to be the number of levels in its tree representation, we 
can describe a pattern downgrading procedure to deal with our problem. 

• Begin the procedure with an allowable pattern depth of 1. Begin at the sons of 
the subtree to be downgraded. 

• Until the pattern matching a node no longer exceeds the allowable depth, replace 
the pattern with the one from which it was refined. If there is more than one 
possibility, ask an "oracle" (in the current implementation, the programmer). 
Record a transaction of any pattern change for this node. 

• Recursively apply this procedure to the sons of the node with the allowable 
depth increased by one. 

In the sample set, there were 188 instances of the construct a *- a+1, of which 25 
were assignment expressions. In most cases, a was a simple variable, but some involved 
field selection operations and pointer arithmetic. The results of the Tenex-Mesa 
statistical study , Sec. 1.3, would predict a larger number of "bump" statements, but that 
study also counted the increment clauses of for statements. During the transformation 
phase, there were 1777jq transactions generated, which involved creating 2 new patterns, 
bump and bumpx, and adjusting the node distribution for 62 previously existing 
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patterns. All of these changes reduced the entropy estimate by .0012 bits per node, 
although the total number of nodes fell from 296895 to 295093, giving an effective 
decrease of .011. Even so, such numbers indicate that "bump" operations do not really 
consume very much of the total program information when viewed statically. 



62 



5. Error Analysis 

5.1. Need for Error Bounds 

One of our goals is to find a lower bound on the size of an object program. The 
smaller the determined entropy, the more compact our encoding can be made. We saw 
in Chapter 4 with both the uniform and nonuniform Markov models, that as we 
increased the order of the source, the amount of history used, the number obtained for 
the entropy became smaller. It should be clear, however, that the possible error in the 
estimate increases with the order of the model- It does us little good to have an entropy 
estimate of 1.0 if this number is known only to within ±10. 

We can view a Markov source as a process that can be in one of a number of 
states, depending upon the context, i.e., the values of previous outputs. The entropy of 
the source is defined in terms of the probabilities of the various states, and the 
conditional probabilities for given states. When we estimate the entropy of a source, 
there are two sources of error: we may overlook some of the states of the process, or we 
may incorrectly estimate one of the probabilities. We know from Theorem 3.2 that in 
the former case, our estimate will be an overestimate. Since such errors are inevitable, 
we can loosely rationalize them as bounding the compression capability of an encoder 
with a given level of complexity. The other source of errors, incorrect probabilities, is 
more serious. Statisticians have shown that for zero-memory sources, estimating the 
probabilities by the relative frequencies produces an underestimate of the true entropy 
[Basharin59]. A Markov source can be viewed as a collection of zero-memory sources, 
one for each state [Cover76], so it seems likely that experimental data will produce an 
underestimate in the Markov case as well. 
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5.2. Notation 

Let us try to understand how the estimation of probabilities from the experimental 
data might cause problems. To facilitate our discussion, we will define some notation. 
It will be instructive to talk first about a zero-memory source S. Let the possible source 
outputs be 

which occur with probabilities 

We don't know the values of the p^, but we have empirical frequencies obtained from a 
"representative" sample. Denote them by 

/Ij, «2' • • • » '^q 
where /ij,+/i2+* * •+'^o = ^' We choose to estimate p. by p^ - n/N. Note that some of 
the n.'s could be 0, so we will continue our convention that log = in the sums 
below, while in actual practice, we simply wouldn't have a term for that s^, 

5.3. Encoding Inefficiencies Induced by Improper Probability Estimates 

Entropy is defined as the average information content of a source output. It is also 
a lower bound for the average code length. We can achieve this bound if can we give 
an output s^ a code of length I{s^^ - log p^. While this is not always possible, assume 
for the moment that such a code can be found. We want to know how the average code 

length is affected when we devise a code, not using the />^, but using our empirical p.^. 
The true entropy of S is given by 

Q 

//(5) = -2 P,. log p. , 
we have approximated this by 

/=l 

The average code length, on the other hand, is defined by the sum 

_</ 
average code length - -^ p- log ^^ . 

/=1 

We know from Lemma 3.2 that the average code length will be larger than either H or 
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H. The real difficulty occurs for those cases when p^ - 0. This corresponds to possible 
outputs which did not occur at all in the sample. The quantity log is undefined, so if 

we had assigned codes on the basis of p^, we would not be able to encode s^ at all. 

5.4. Experimental Tests of Sample Variability 

One means of determining the quality of our probability estimates p^ is to use them 
to devise an encoding and then try the encoding out with new data, obtaining an 

empirical average cost. We will refer to the original data used to calculate p^ as the 
training sample. If both the training sample and our new test sample are representative, 
then the average cost should be close to H. For the purposes of estimating the error, it 
is not really necessary to produce the encoding, we can simply assume that we have 
devised a code with codeword lengths based on the probabilities. 

In order to deal with the case of outputs not in the training sample, we will add to 
our encoding an "escape code". Whenever we see the escape code, we know that the next 
/ bits denote the output in a less compact, but complete code. We will take a small 
quantity e away from each of the probabilities p^ in order to create the escape code. 

Let J>. - (l-e)p^. Suppose that we can encode the test sample using a code with 

codeword lengths of -log J)., except for "new" outputs, which require a codeword of 
-log e, followed by / bits of a less compact code. Suppose that in the test sample, there 
are h "new" outputs, and that s^ occurs m^ times. For those ^/s whose frequencies 
contribute to h, let m^ - 0. Let us determine the total cost of encoding the test sample. 

Let m^+m2+' ' '+m^+h = A/. 

cost = -2 '"/ ^^^ h ~ ^ ^^S € + M 
i 

= -2 "^i Jog (l-e)p,- - h log e + hi 
i 

- "2 "^i '^S Pi - {f^'f) log (1-c) - h log c + /i/ 
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As £-*0, the -log (1-c) term approaches 0, but the -log e term becomes large. 
Therefore, a careful choice of £ can reduce our encoding cost. The value of e must be 
fixed before encoding a new sample, but we can make an a posteriori determination of 
the value of e that would have minimized the cost. We can do this by computing the 
derivative 

dcost M'h h 
9c 1 -c c 

and setting the expression to 0. This yields a minimizing value of h/M, which can be 
given the unsurprising interpretation that the best choice of c is the actual probability of 
a new output. 

Although c must be fixed, there is no reason that we cannot use several training 
samples in order to determine an empirical probability of encountering outputs not in 
previous training samples. 

5.5 Applying Experimental Tests to the Markov Models 

The problem of finding an optimal encoding scheme for a Markov source is 
considerably more difficult than finding one for a zero-memory source. Pasco 
[Pasco76] has a good discussion of the published literature in the field. Our principle 
concern, however, is to gauge the variability in our sample set. For this, it is not 
actually necessary to produce an encoding, but rather to use the empirical probabilities 
to hypothesize code lengths and see if the distribution of nodes in the new sample is 
well matched to these lengths. 

A Markov Source Example 

The techniques described above for zero-memory sources can be readily extended 
to Markov sources. To gain insight into the extension, it is instructive to look first at 
some actual experimental data for a simpler source. Consider again the Markov source 
defined by the state diagram in Figure 3.3. Assign probabilities to the transitions as 
follows: 
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state 


output 


probability 


a 


a 


.3 


a 


b 


.7 


ab 


a 


.4 


ab 


b 


.6 


bb 


a 


.5 


bb 


b 


.5 



Figure 5.1. Conditional probabilities for the example. 

If we assume that the source is stationary, we can calculate the steady state 
probabilities of the various states by a technique beyond the scope of this discussion (see 
[Ash65]). They are given in Figure 5.2. 



state 


probability 


a 


.3937 


ab 


.2756 


bb 


.3307 



Figure 5.2. Steady state probabilities for the example. 

Using the values from Figures 5.1 and 5.2, we can compute the entropy of the 
source to be .9452 bits per symbol. Figure 5.3 show results from a program that 
randomly generated a string of a's and b's based upon the probabilities from Figure 5.1. 
It is comforting to see that the empirical steady state probabilities are close to the 
theoretical ones. 
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54 






.28 
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28 
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.39 


.72 


ab 


a 
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11 


13 
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13 


60 
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ab 
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15 
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15 


18 


18 


84 


144 


.29 


.58 


bb 
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15 


17 


15 


18 


18 


83 






.52 


bb 
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14 


17 


10 


25 


11 


77 


160 


.32 


.48 



Figure 5.3. Randomly generated data. 
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A Computationally Simpler Formula for Entropy 

The formulas for entropy are defined in terms of probabilities and their 
logarithms. Since we are using relative frequencies for approximations of the 
probabilities, perhaps there will be some cancellation of terms in the formula when we 
deal with the frequencies not as decimal numbers but as fractions. 

H{S) = p^H{S\ a) + P^f^H{S\ ab) + Pf^^H{S\ bb) 

V 196 ^ 196 196 ^ 1967 



500 

144 
500 

160 



r._60 , _60.-_M,og-84^ 

V 144 ^ 144 144 ^ 1447 

f_ _83 log -^--11 log -IT) 

V 160 ^ 160 160 ^ 1607 



500 

= sIq- (i96 ^og 196 + 142 log 142 + 160 log 160 

- 54 log 54 - 142 log 142 - 60 log 60 

- 84 log 84 - 83 log 83 - 77 log 77^ 

= .935 bits per symbol. 

This example shows that the reduced formula is indeed much easier to compute, 
with only one division required at the end, rather than requiring all of the probabilities 
to be computed. It is easy to derive the corresponding formula for arbitrary sources; we 
will merely state the result. Let n^ denote the number of occurrences of pattern tt in 
the sample. Let n^. denote the number of occurrences of the node s^ in a situation 
matching pattern it. If there are a total of A^ nodes in the sample. 



H{S) = j^{^n^\o% n^ - 2«^/ log //^.). 



With the customary formula, it is necessary to know the frequency of occurrence of 
each pattern («^) in order to estimate the probabilities Pj\^. This implies that if we 
have only sequential access to the data, it is neccessary to make two passes over the data 
in order to calculate the entropy. The formula above does not have this restriction; we 
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can use the frequency of each node in a pattern and accumulate pattern totals. One 
should ask, however, what roundoff error is introduced by the subtraction of the two 
potentially large sums. There does not seem to be a real problem; when the two 
formulas were used to calculate the entropy of the program tree sample (300,000 nodes), 
the two estimates differed by less than one part per million. 

5.6. Computing Average Code Length 

Suppose now that we have used a training sample to estimate the probabilities that 
characterize our source. We now wish to encode a new sample using code lengths 
determined by these probabilities. Let n^ and n^. denote the pattern and node 
frequencies for the training sample. Consider the formula for average code length, 
ignoring for the moment the possibility of nodes in the sample not present in the 
training sample: 



^ average code length::: i 2"w/ ("^<^S ^/jw) 

lT,i 

= W S'',,(-l08K/'':J)) 

71,1 

= W( 2«,r/ 'og < - ^"iTi Jog </) 

One should note that if the training sample and the test sample are the same, the 
average code length formula reduces to the formula for H, the estimated entropy. We 
can apply this formula to the data of Figure 5.3 using 80% of the data as a training 
sample and the remaining 20% as a test sample; Figure 5.4 show the relevant counts. 
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Example 
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11 


29 


66 


131 



Figure 5.4. Node counts considering first 80% of data as a training sample. 

We can readily calculate the average code length from the formula: 

average code length = -t^q- MO log 156 + 31 log 113 + 29 log 131 

- 10 log 44 - 30 log 112 - 13 log 47 

- 18 log 66 - 18 log 65 - 11 log 66) 
= .921 



If we repeat this process for each of the 5 samples, we get the average code lengths 
shown in Figure 5.5 



missing 
sample 



standard 
mean deviation 



length 



.963 .918 1.037 .896 .921 .947 .056 



Figure 5.5. Average code length for each sample in terms of the others. 

Allowing for Outputs not in the Training Sample 

As we have seen before, we must allow for nodes in the sample that occur in 
situations where they did not occur in the training sample. We do this by "stealing" a 
smaii probabiiity e^ from each of the probabilities ^h^. When computing the cost of 
the "surprise" nodes, we add a length of -log e^ as an "escape code" plus a length of l^ 
for a naive encoding of the node. The average code length is a little more complicated: 
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average length = ;^ ( 2 ^-ni ^'^^^ ((I'^tt^^/Itt)) + 2 ^^tt/'^^S ^tt •*" 'tt))' 



Wl If I 






This allows us to restate the cost as the following visually complicated but 
computationally simple formula: 



average length = 1 (^g^ log /i^ - 2 "^/ ^^B </ "S^tt ^^8 (I'^tt) 

71 7T , i IT 

TTl 

"2^^ log^TT +2''7rO- 



W IT 



5.7. A Criterion for Deciding Whether to Use a Pattern Refinement 

Adding a new pattern to the source model defines a zero-memory source: the nodes 
appearing in situations matching the pattern. We can use the techniques of the 
preceding section to estimate the variability of the sample set for the new pattern. 

Suppose that our data contains q distinct nodes, each occurring n- times for 
l<i<q. Let 2^^/ ~ ^' ^® ^^" calculate an estimate for the entropy of this 
distribution using the formula: 

H{S) =^(A^log yV- 2 '^/^og ^d 

i 

When we compute the entropy of 5 from the finite sample, we make the assumption 

that n./N is a valid approximation to the probability that node s^ will appear in a 

similar situation in future programs. If, however, the "density" of s^ nodes is highly 

non-uniform across the sample, this is not a very good assumption. Let us now consider 

how to know if the assumption is worthwhile. Let the sample be divided into r roughly 

equal pieces, with total sizes of N^'^\ l<k<r. Denote the count of s. nodes in the k}^ 

by r/^.. Using the formulas of Section 5.6, we can obtain r values for the average code 

length, using r-1 pieces of the large sample as the training sample as follows: 
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Total training sample size: A^ - N^^\ 
Training count for node sf. n. - r/j. 
Total test sample size: A^^^, 
Test count for node s.: /j^. 

The only remaining quantity in the average code length formulas is e, the "fudge factor" 
for nodes not in the training sample. For the purposes of deciding upon a refinement, 
we will simply use e equal to the empirical probability of an unseen node (shown to be 
the best possible c in Section 5.4). 

It is instructive to look at the two patterns in Figure 5.6, taken from Appendix C. 
They each predict the frame of a variable (local , global , field , or entry , see 
Chapter 2). In the sample, only local and global variables occurred in these 
constructs. Pattern 180 predicts the frame of a variable on the right side of a relational 
operator. Pattern 185 predicts the frame of a variable that is a "subscript", either of an 
array or a string. 

(180) {relE|re1G|relN|re1L|relGEirelLE}[*,var[0]] 
f = 403 H = .998, p = .001, H contr. = .001 
global 212 .53 local 191 .47 

(185) {index|dindex|seqindex}[*,var[@]] 
f = 289 H = .615, p = .001, H contr. = .001 
local 245 .85 global 44 .15 

Figure 5.6. Two patterns from the final pattern set. 

Figure 5.7 shows the results of encoding approximately 20% of the sample using 
code lengths determined by the statistics of the remaining 80%. If the distribution of 
nodes is reasonably uniform across the sample, the average code length should be close 
for each of the 5 samples. For Pattern 180, this appears to be true; we can predict with 
some confidence the probability distribution for nodes in situations matching the 
pattern. With Pattern 185, however, the situation is much worse; we would need a much 
larger sample to decide the actual probabilities. Thus, if we were to auloniaie the 
selection of refinement patterns, we would rule out pattern 185 as a viable candidate. 
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.373 


.309 
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,255 
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1.498 « 139% 



Figure 5.7. Average code length using codes based on remaining samples. 

This procedure was devised after the results shown in Appendix C were obtained. 
There were a few obviously bad patterns like 185 above, but the typical pattern produced 
a standard deviation of 15% for the 5 sample pieces. 

Once we have a criterion for deciding on the quality of the sample for a given 
pattern, we can use it to decide whether a particular pattern refinement is warranted. 
Figure 4.6 showed a pattern refinement where two new patterns were added. Consider 
now only one of them, pattern 219. Pattern 7 predicts the length of a list in the parse 
tree; pattern 219 predicts the length of a list which is a parameter list of a procedure 
call. Figure 5.8 shows the average code length calculation for pattern 7 before 
refinement. 
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length 
Figure 5.8. Variability of pattern 7 before refinement. 

Recall that we are interested in lowering our entropy estimate. When we refine the 
pattern, we obtain two new patterns, and replace the contribution to the entropy of the 
original pattern by a linear combination of the entropy estimates of the two new 
patterns. Although the entropy estimate for the combined new patterns will never be 
higher than that of the original, the variability of the sample may well place us in a 
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situation where the mean plus one standard deviation is higher after the refinement than 
before. This is a usable criterion for deciding on a refinement. Figure 5.9 shows the 
two patterns after refinement. 
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Figure 5.9. Variability of pattern 7 and pattern 219 after refinement. 

If we use the mean code length figure for our estimate, the original estimate of 
2.508 is replaced by the linear combination 



5609 
7425 



2.789 + 



1816 
7425 



1.107 = 2.377. 



The linear combination figure for the 5 samples has a standard deviation of .163. We 
note that 2.377+.163 is less than 2.508+.142, so the refinement is warranted. 

5.8. Evaluating the Entropy Estimates for the Program Sample 

In this section, we will reexamine the entropy estimate obtained in Chapter 4 for 
the final pattern set. Certain simplifying assumptions used in the estimates are removed 
in order to make a more realistic estimate. 

String Literals 

As we saw in Section 4.2, the string literals that occurred in the test sample were 
not suited to the same sort of encoding as were the other tree nodes. In the uniform 
Markov studies, strings were given no special treatment since they do not grossly affect 
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the results. When we calculate the average code length for the program sample, we are 
in a position to treat the string literal nodes as a special case and use code lengths for 
them that more accurately reflect the actual number of bits required in an object 
program. 

Quite a bit of work has been done on the problem of "compressing" character 
strings. Hehner discusses several possible schemes in his thesis [HehnerVl], Some of 
the more interesting methods for string compression are adaptive, using early parts of 
the string to encode later ones. The analysis programs for this thesis use a very 
conservative approach: the string length is given, followed by the characters of the 
string. In order to have the average code length for strings closer to the entropy, the 
individual characters are Huffman coded, requiring on the average 5.06 bits per 
character as opposed to the 7 bit ASCII codes used by the compiler. 

Average Code Length for the Program Sample 

It is straightforward to modify the average code length formulas of Section 5.6 to 
consider string literals as a special case. Recall that the entropy estimate for the sample 
was 1.642 bits per node. This corresponds to an average code length using the entire 
sample as a training sample, and setting e^ = 0, for all it. When we simply add a more 
realistic cost for string literals to the computation, the entropy value goes to 1.702 bits 
per node, the 4% increase mentioned in Section 4.2. 

More interesting numbers are obtained from the average code lengths for encoding 
part of the sample using code lengths based on the remaining sample. In this case, we 
have to be more conservative in our choice of e^ than in the procedure for deciding 
upon a refinement. The procedure is as follows: 

• The encoding process is run on some portion of the samples, with a count of 
missing nodes maintained for each pattern. 

• For each pattern, a value of e^ is chosen equal to the empirical probability of a 
missing node for that pattern. 
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• For patterns with a zero empirical probability, c^ is set to a small positive 
constant (.001 in this case) unless it is known that indeed there can be no 
missing nodes. An example of the latter is the var node which is constructed 
with one of four possibilities as its first son. In these cases, a value of c^ = is 
warranted. If an empirical probability of 1 is obtained, some smaller value (.5 in 
this case) is used for e^. 

Table 5.10 shows the results of encoding 20% of the sample using code lengths 
based on the remaining 80%. The values of e^ were chosen by considering the first 
three samples as described above. 

Test Sample standard 

1 2 3 4 5 mean deviation 

count 62322 59001 58638 56843 58289 

iVngih 1.940 1.978 1.726 1.729 1.799 1.837 .230 = 13% 

Figure 5.10. Average code length and variability for program sample. 

The mean code length from Figure 5.10 of 1.84 bits per node is a reasonable 
estimate of the entropy of the source mdoel defined by the final pattern set. The 
standard deviation figure gives a measure of confidence in the entropy estimate. 

We must be careful when choosing a division of our sample into pieces for the 
above procedure. If there are few large pieces, the training sample is correspondingly 
reduced, yielding poor estimates of the probabilities. If there are many small samples, 
the nodes of the test sample are less likely to have a typical distribution. Figure 5.11 
shows the procedure applied to the same sample broken into 10 pieces instead of 5. 

Test Sample 
12345678 9 10 

count 23664 21407 20056 28655 24096 29634 34974 32747 42266 37594 

ave 

length ^-^^2 2.060 1.941 1.969 2.002 1.653 1.630 1.534 1.937 1.943 

mean = 1.840 standard deviation = .376 =^ 20% 

Figure 5.11. Average code length procedure applied to 10 sample pieces. 
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The mean code length was almost the same for both applications of the procedure, 
indicating that the probability estimates from 80% of the samples were close to those 
from 90%. The smaller test sample size in Figure 5.11 results in a larger variability of 
the average code length, but within the factor of \/2 expected by doubling the number 
of samples; the samples are still large enough to have typical distributions of nodes. 

5.9. Conclusions 

The error estimating procedures produced for this thesis give a reasonably good 
picture of the variability of the empirical sample. They do not, however give 
mathematically rigorous estimates of the error bounds. This is an area ripe for further 
study, presumably by someone with a strong grounding in theoretical statistics. 
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6. Conclusions 

6.1. Applications 

The entropy of a program source defines a lower bound for the size of object 
programs. The most obvious application of this is the evaluation of existing compilers. 
As we have seen, to the extent that additional program dependencies exist, we can lower 
the average code length for programs by using more complicated encoding schemes. 
More complicated encoding schemes, however, entail more complicated decoding 
schemes or interpreters in the case of object programs. A designer can look at the 
performance of his current encoding and compare it to the theoretical minimum. If the 
current performance is "close enough", then the potential increase in size of the 
interpreter may not warrant further work on encoding. 

One assumption of the analysis routines is that the program sample is 
representative of programs that will be written in the future. Just as a present day 
programmer might choose ALGOL for a numerical analysis task. Lisp for a list processing 
task, and Cobol for a business application, it makes sense for a given language to have a 
compiler that can generate several different object codes, each tailored to a type of 
program. For example, input-output device driver programs and compilers may not 
have the same instruction mix. One could easily have several data bases of programs on 
which to run the analysis routines, each containing programs of a given class. 

While the patterns used to estimate the entropy do in fact define an encoding for 
the language, it is not one that is easy to implement. The analysis routines build the set 
of pattern incrementally and never have to recognize patterns in actual trees after the 
initial pattern set is created. If one were trying to "automate" the design of an object 
code, it would probably be worthwhile to limit the number of possible patterns (states of 
the encoder). One could intuitively think of states such as statement, expression, string 
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constant, etc. The methodology of this thesis would allow evaluation of the merit of 
such a constrained set of patterns. 

Another related application is the ability to answer questions about programming 
style. Appendix C.4 - C.6 is full of interesting information. For example, in Pattern 9, 
we see what sort of parameters are used in procedure calls. These statistics, together with 
patterns 219 (lengths of parameter lists), and 246 (elements of parameter lists) tell us 
quite a lot about how we should design procedure linkage. Figure 6.1 shows the 
percentage of all procedure calls in the sample having a given number of parameters. 
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Figure 6.1. Number of parameters in procedure calls (static count). 

These numbers point out an interesting phenomenon. The statistics taken on 
Tenex-Mesa programs, discussed in Section 1.3, that were used to design the encoding 
of Mesa programs indicated that most procedure calls had fewer than 6 parameters. 
While Mesa allows an arbitrary number of parameters, it is more efficient if there are 5 
or less. The programmers in the test sample were implementers of the language who 
knew this; they didn't use more than 5 parameters. This is a feedback situation that one 
must be careful about when defining new features. 

Another use for these program statistics is in language design. Suppose that a 
designer is adding a feature to the language to simplify a common construct that 
previously took many symbols to specify. Knowledge of how the construct is used in a 
large collection of programs can guide the designer in deciding what should be the 
default values of various parameters to the new feature. 

6.2. Extensions and Directions for Further Research 

All of the counts of various program constructs studied in this thesis are static 
counts. That is, they reflect how often something appears in the program text rather 
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than how often it occurs in the course of program execution. When Knuth made his 
study of Fortran programs [Knuth71^, he found that less than 4% of a typical 
program accounted for more than half of its running time. Thus the distribution of use 
of various statement types was different for the dynamic counts, those weighted by the 
number of times that they are executed. An obvious candidate for extension of the 
analysis programs of this thesis is to find a way to investigate dynamic statistics of 
Mesa programs. 

The first thing that one would need for studying dynamic statistics is a facility for 
producing them. Since the language was under development at the same tirrie as the 
static analysis, it was not considered a good idea to add the additional complication of 
making the necessary modifications to allow statement counts. Presumably such 
facilities will be provided in the future, but they were not ready in time for this thesis. 
Most of the static analysis procedures could be extended to allow frequency weights on 
the trees, but some care would have to be chosen in defining previous in the dynamic 
case. One would probably have to use some global flow analysis, restrict dependencies to 
cases where the flow is well understood, or establish a known context at all labels. 

Another use for the methodology is the ability to simulate and evaluate new 
features. For example, with a slight extension to the matching routines, one can ask 
such questions as "How many times is a variable the same as the previously mentioned 
variable?" Such questions are probably more interesting in the dynamic case, but are 
still of some merit in static counts. 

Another extension mentioned briefly in the preceding Section is the problem of 
limiting the size and complexity of the interpreter. One interesting, but difficult, 
approach would be to specify a host machine for the interpreter, say one of the 
commercially available microprogrammable computers, with a fixed size of control store 
and try to produce an "optimal" encoding for the language subject to the constraint that 
the interpreter fit into the chosen machine. A simpler problem would be to limit the 
number of allowed patterns, thereby limiting the possible interpretations of code bits 
and hence the size of the interpreter. 



80 



Chapter 6: Conclusions 



As was mentioned in Chapter 5, very little published work is available on error 
bounds for estimating entropy of a Markov source from experimental data. For that 
matter, there are only a few papers on error analysis for zero-memory sources 
[Basharin59] [Pfaffelhuber71] [Nemetz72]. Hopefully, as information theory is 
applied more often to the analysis of programs and language constructs, there will be a 
motivation within the statistics and information theory community to provide these very 
difficult theoretical analyses. 

Another building block needed for the "automatic generation of program codes" is 
a means for deciding what patterns to try for refinement of existing patterns. Such a 
program falls into the domain of artificial intelligence; a set of heuristics for pattern 
"growing" could perhaps be obtained by looking at the performance of several 
programmers doing the task by hand. 

In conclusion, it should be noted that even without facilities for automating 
instruction set design, much useful guidance can be obtained by the ability to obtain 
program statistics and by the metric that information theory provides for gauging the 
relative importance of various constructs. 
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Parse Trees 



Node name Sons Meaning 



abs 1 

addr 1 

and 2 

apply 2 

arraydesc 1 

assign 2 

assignx 2 

base 1 

body 1 

call 3 

caseexp 3 

casestmt 3 

caseswitch 3 

casetest 2 

catchmark 1 

catchphrase 2 

construct 2 
constructx 
continue 
dindex 



ABS operator 
address of variable 
AND operator 

used by early passes for id [ explist ] 
descriptor for ARRAY 
assignment statement 
assignment expression 
base of ARRAY 

list of statements of program or procedure body 
procedure call (proc, args, catch phrases) 
case expression (cv, list of cases, endcase) 
case statement (cv, list of cases, endcase) 
case statement that can be implemented by a dispatch 
case statement items with constant labels 
used to mark label for RETRY, etc. 
SIGNAL catching statement list 
record constructor statement 
2 record constructor expression 

CONTINUE statement 

2 index using ARRAY DESCRIPTOR 

div 2 / operator 

dollar 2 field select within sonl variable 

do stmt 5 do statement 

dot 2 field select with sonl variable pointer 

dst 1 dump state (low level machine dependent) 

downthru 2 down through range (cv, range) 

enable 2 ENA.BLE statement (catchphrase, statementlist) 

entry frame = procedure entry point 

error 3 ERROR statement 

exit EXIT statement 

extract 2 extraction from record 

fdollar 2 field extraction from procedure return record 

fextract 2 extraction for procedure return record 

field frame - field of a record 

forseq 3 sequence type FOR statement iteration 

global frame - global to procedure bodies 

goto 1 GO TO statement 

ifexp 3 if expression (boolean, then, else) 

if stmt 3 if statement (boolean, then, else) 

in 2 IN relational 

index 2 element of an ARRAY 
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Nodename Sons Meaning 



in! inecall 


3 


in tec 


2 


intCO 


2 


in toe 


2 


in too 


2 


item 


2 


label 


2 


length 


1 


list 


? 


local 





1st 


1 


Istf 


1 


loophole 


2 


max 


1 


memory 


1 


min 


1 


minus 


2 


mod 


2 


mwconst 


1 


new 


3 


not 


1 


null stmt 





num 


1 


op en stmt 


2 


or 


2 


plus 


2 


register 


1 


relE 


2 


relG 


2 


relGE 


2 


relL 


2 


relLE 


2 


relN 


2 


resume 


1 


retry 


G 


return 


1 


row 


1 


rowcons 


2 


rowconsx 


2 


seqindex 


2 


signal 


3 


start 


3 


stop 


1 


str 


1 


SVC 


1 


syserror 


1 


temp 





times 


2 


uminus 


1 


unionx 


2 


uparrow 


1 



call of INLINE code 

interval closed on both ends ' 

interval closed on left, open on right 

interval open on left, closed on right 

interval open on both ends 

general pairing node with many uses 

block with exit labels (stmt list, exit list) 

LENGTH of ARRAY (from DESCRIPTOR) 

only arbitrary length node in trees 

frame = local to a procedure body 

load state (low level machine dependent) 

load state and free (low level machine dependent) 

treat variable as having another type 

MAX operator (of list) 

contents of named memory address 

MIN operator (of list) 

- operator 

MOD operator 

multiword constant (list of values) 

NEW statement 

NOT operator 

null statement 

numerical literal 

OPEN statement 

OR operator 

+ operator 

contents of named REGISTER 

= operator 

> operator 

>= operator 

< operator 

<= operator 

# operator (not equal) 
RESUME from SIGNAL handler 
RETRY statement 

RETURN statement 

list of value for ARRAY construction 

ARRAY constructor statement 

ARRAY constructor expression 

character selection from STRING 

SIGNAL statement 

START statement 

STOP statement 

STRING literal 

call on operating system function 

ERROR statement for unspecified ERROR 

associated with constructors 

* operator 
unary - operator 

associated with constructor for variant records 
indirect address operator 
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Nodename Sons Meaning 

upthru 2 FOR sequence up through range (cv, range) 

var 4 variable (frame, address, bit address, length) 

vconstruct 2 constructor statement for variant records 

vconstructx 2 constructor expression for variant records 
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B.l. General Comments 

While the low level details of the implementation are not of general interest, there 
are a few principles and techniques which would be useful to someone implementing a 
similar system. 

The analysis programs and their associated data structures underwent several 
evolutionary changes. There were two general forces that motivated these changes: 
increasing amounts of data, and a desire to facilitate incremental changes to the entropy 
estimate. 

On the subject of increasing amounts of data, one should consider the following 
advice: // you have a lot of data to process, you should consider using data processing 
techniques. One of the more useful pieces of software used in the analysis was a general 
sort/merge package. It was used in three or four different contexts, simply by supplying 
it with different input, output, and comparison procedures in each case. This removed 
the necessity of keeping all information in the computer main memory at once. In early 
versions of the analysis routines, the trees were used exactly as obtained from the 
compiler and converted to the analysis form "on the fly"; for the final version, the trees 
were converted in a batch mode and all the trees stored on a large data file. 

Another useful tool is a hash table package with a very easy-to-use interface. It 
allows the storage and retrieval of several types of items (1 and 2 word keys, strings, etc.) 
and allows counts to be associated with items. The set of routines to find and update 
patterns contains four separate instances of the hash table package. 
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B.2. Machine Considerations 

The analysis programs are written in an earlier version of Mesa that runs under 
the Tenex operating system [Bobrow72]. The machine used has a 36 bit word with 18 
bits of address space. A most useful facility provided by Tenex is the mapping of disk 
file pages (512 words) into the user's address space. This greatly simplifies virtual 
memory management since "dirty" pages are automatically written back to the disk. The 
large word size allowed auxiliary information to be added to the trees "for free" when a 
potential was seen for improving the system performance by using this information. 
The storage management facilities of the Mesa runtime system make it easy to obtain 
blocks of free storage for making linked data structures, or for providing space for hash 
tables. 

B.3. Token Representation 

The basic item under investigation is the tree node. In order to facilitate 
manipulation of nodenames, a uniform representation, called a tokenhandle, is used. A 
tokenhandle occupies 18 bits, or a half-word of memory. It has three fields: a 
tokentype,2i datavalue, and a data/lag. In the final version of the analysis routines, the 
tokentype has one of the following four values: 

node name a non- terminal produced by the compiler 

numlit a number 

strlit a string 

newnode an invented nodename, like var, or global. 

Earlier versions had a larger collection of tokentypes. Interpretation of the datavalue 
field depends upon iht tokentype and the dataflag. 

tokentype dataflag datavalue 

index into an array of strings 

a small positive or negative number 

a hash table index 



86 



nodename 


— 


numlit 


false 


numlit 


TRUE 
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strlit — a hash table index 

newnode — a hash table index 

The overflow numbers and the string constants are contained in the same hash 
table, which only requires 2000 words of storage for the entire 400,000 words of sample 
trees. For reasons which don't seem very strong in retrospect, the new nodenames are 
kept in their own (200 word) hash table. 

B.4. Tree Representation 

Experience with early versions of the analysis programs shows the inadvisability of 
having the samples trees share the same address space as the programs. In fact, there are 
so many trees in the sample that they require more than 18 bits to address when stored 
in a straightforward way. Nevertheless, it is helpful for the matching routines to have 
random access to the set of trees. Fortunately, Tenex allows easy maintenance of 
virtual memories. The data structures used to store the trees are defined using 21 bit 
pointers, more than enough for the amount of data available for analysis. 

Since the total space taken up by the trees is not a critical problem, a very 
conventional representation is used for them. Each tree has a one word header that 
contains a tokenhandle called the node name, and a soncount. It then contains as many 
more words as it has sons, each containing a sonvalue. A sonvalue has three fields: 

termson a boolean variable that says whether this son is terminal or 

nonterminal. 
sonptr if a terminal son, contains a tokenhandle, otherwise contains a 21 

bit pointer to the root of the subtree. 
patseq contains the sequence number of the pattern matched by the node 

in this son position of this particular tree. 

The uses of the patseq field are discussed in the following section. 
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B.5. Tree Matching and Pattern Refinement 

To decide what larger patterns should be tried as refinements, and to actually 
accomplish the refinement, a facility for matching trees is required. The routines used 
for this thesis underwent several complete replacements, but none of the algorithms 
represent any advances in the art of tree matching. The first routines were very 
straightforward in design; they could be called "brute force" matching procedures. 
While slow and lacking in esthetics, the procedures worked well enough on the small 
sample set available at the time and provided insight for the design of the other analysis 
routines. When the size of the sample made the matching routines impractical, the 
pat seq field was added to the sonvalue record to speed things up as discussed below. 

For this thesis, we are interested in a specialized form of tree matching: we want to 
take an existing pattern and see how the distribution of nodes matching that pattern 
changes when the pattern is made larger. This takes two related forms: 

1. For hypothesizing new patterns, we want to obtain statistics for all possible node 
values occupying a position defined by a particular structural relationship to the 
existing pattern. 

2. For refining an existing pattern, we want to find each node matching the old 
pattern which also match the new one, note what node occurs in the match, and 
update the data base containing statistics on the node distribution for the various 
patterns. 

The patseq field in the trees makes the above tasks simple. A traversal procedure 
"walks" through the forest of program trees, maintaining a list of ancestors and 
brothers. For each tree to be visited, the patseq field of the sonvalue pointing to that 
tree is used as an index into an array. If we are not interested in expanding that pattern, 
the entry is null, otherwise it points to a list of items, each specifying a structurally 
related node such as father, brother-3-h<ick, etc, and one of the following actions for 
matching the tokenhandle which is the nodename in that node: 
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simplename match a tokenhandle included in the item 

listof names match one of a list of tokenhandles pointed to by a pointer 

in the item. 

freename match any tokenhandle, storing the value in a specified 

position in an array of free values. 

boundname match the tokenhandle stored in a specified position in the 

array of free values. 

During the refinement procedure, whenever an instance of the new pattern is 
encountered as described in Section 4.4, a transaction (old pattern, new pattern, 
nodename, count) is added to a hash table (or the count of an existing one incremented), 
indicating that the node now is in the set of nodes matching the new pattern. At the 
same time, the patseq field of the sonvalue pointing to the node is updated to reflect the 
new pattern. Figure B.l shows the same tree as Figure 4.5, but with the pattern sequence 
numbers corresponding to the final pattern set. 

[245] 

if stmt 

1 



/ I \ 

[21] tZZ] [23] 

relN assign <empty> 



[36] [37] [11] [12] 

dot num var relE 

1 



[13] [14] [194] [167] [213] [3] [159] [19] [20] 

var var global 5 1 var num 

_l \ _J 



/ I I \ / 1 I \ / 1 r \ 

[173] [199] [3] [234] [174] [243] [163] [267] [179] [209] [3] [160] [194] 

local 1 16 field 16 local 2 8 13 

Figure B.l. Program tree showing pattern numbers from final set. 

While the patseq scheme works satisfactorily, it is probably not the best. If disk 
space is not a concern, it would speed things up considerably to keep a completely 
inverted file with a list of tree address for each node matching each pattern, and making 
the pointers in the trees two-way pointers so that father nodes can be found without 
having to traverse the tree from the top. 
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B.6. Average Code Length Calculations 

Recall from Section 4.5, dealing with program transformations, that we sometimes 
find it necessary to delete all of the nodes in a subtree from the data base of matched 
nodes. This is accomplished by a procedure that produces a set of transactions changing 
the matched pattern of each node to 0. If one applies the procedure to a collection of 
entire program trees and writes these transactions onto a file, the data can be given a 
different interpretation. The file contains a concise record of pattern-nodename pair 
counts for the portion of the sample that was used to produce the file. This is precisely 
the data needed for the entropy formula derived in Section 5.5. If we further sort the 
data so that all nodes matching a given pattern occur together, it becomes trivial to write 
a program estimating the entropy for this set of trees. 

Of more importance is the average code length computation of Section 5.8. These 
are produced as follows: 

• The number of sub-samples, ai, is decided upon (in this case 5). 

• Data files are produce by the elimination procedure for n collections of program 
trees of approximately the same magnitude, whose union is the entire sample 
set. A simple program transforms these data into records that contain: 

pattern sequence, nodename, sub-sample number, count. 

• The data for all sub-samples are sorted according to the first three keys named 
above. 

• Another simple program produces a file with n+3 word records, containing the 
following: 

pattern sequence, nodename, sample total, (n) sub-sample totals. 

• It is now simple to calculate the average code cost of one sub-sample in terms of 
the others by using the sub-sample totals (some of which may be zero) and 
sample totals for the various patterns and nodes (the actual names of the nodes 
are irrelevant, though they are contained in the file). 
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These procedures were developed after most of the data analysis was performed for 
the thesis. If one were designing a new system to analyze a language, it would be 
worthwhile to consider how much of the computation can be done by such batch 
techniques. 

B.7. Tree Printing 

There are several figures in this thesis that contain diagrams of trees; these were 
primarily produced by a tree printing program. While there are some shortcomings to 
the program, there seems to be no published algorithm that does better. See, for 
example, Knuth's program for printing small binary trees [Knuth71b]. The program is 
based on a few easily stated principles: 

• All nodes at a given level in the tree are at the same level on the page. 

• Each nonterminal node is centered over the names of its sons. 

• ~ The width of the resulting tree is minimized. 

There could be many implementations of such an algorithm. We will loosely 
discuss the program used for the thesis. The program relies heavily on the fact that all 
characters and blanks on the printing device are the same width, a restriction that is met 
by most currently available computer output devices. The program was also designed 
under the assumption of the availability of large amounts of data space in the computer 
memory. When we discuss the program, it will be useful to look at some sample output. 
Figure B.2 is a portion of the tree from Figure 3.2. 
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Figure B.2. A portion of a printed tree. 

The program represents the entire tree image in memory before printing any of it; 
the unit of representation is a record called a pageentry. This record contains one 
character on each of three lines, called the bracket, the arrow, and the name. The 
principle data structure is an array called row. Each entry of row has two fields, an 
integer named nextavailable, and an array of pageentry' s named column. The 
interpretation of these variables is as follows: 

• row\_i'\Mextavailable is the next free character position of the /^^ row of the 
printed tree. 

• row\^i'\. column is an array of pageentry' s, each containing a single character on 
each of 3 adjacent lines on the output. row\_i'].column\J'] is the pageentry that 
describes the f^ character position of those lines. 

• rowl^fy.columnXJ'^.bracket is a blank or a character in the line below a nodename 
(I or J. 

• row[_i'\.column\J'\.arrow is a blank or a character pointing down to a nodename 
(/. I. or \). 

• row\ii'\.column\J'\.name is a blank or a character in a nodename. 

Since trees come in assorted sizes and shapes, it is difficult to see what size to make 
the various arrays. In the program, the arrays are allocated at run time. The length of 
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the array row is made equal to the maximum depth of the tree. The length of the array 
rowli'].column is made equal to the rightmost character position occupied by a nonblank 
character in any of the three lines of the i^^ row. The determination of the lengths for 
the column arrays is made by running the placement procedure described below without 
actually storing into the arrays. 

In our discussion below, if t is a tree, let r^, (2, . . ., tj^ be the sons of /. Let 
jnodename denote the nodename of the root of t, and let n^'^dename^ denote its length. In 
order to simplify the program, we will have each nodename contain a trailing blank, 
which is also reflected in its length. The constant blankentry is a pageentry mth all 
three characters equal to blank. 

The Placement Procedure 

Most of the work of the program is done by a recursive procedure called 
placement, which is called with a pointer to a tree and a few other parameters and 
returns the character position of the nodename of its root. 

Calling Sequence: placement [ /, leftmost, rownum'\ 
Input Parameters: 

t — a tree to be placed on the "page." 

leftmost — the leftmost character position allowed for its nodename. 

rownum — the number of the row on which to place the name. 
Returned Value: pos — position actually given to the nodename. 
Side Effects: builds a representation of the tree in the array row. 

Description of the procedure: 

1. {Make first approximation to placement) 

Set / = MAX{ leftmost, row\^rownum].nextavailable} 

2. (Terminal node?) If t has no sons (k^ = 0), set pos = /, and go to step 12. 
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3. (Calculate widths.) Set r = / + \t"odename^^ 

set totahons = 2 Uf ^"""n 
/=i 

4. {Place first son.) Set / :: placement [ t^ (l+r)/2 - {total sons-l)/!, rownum-¥\\ 
set fl/ = / 1. |/«^''^"^'"^|/2. 

5. {Only one son?) If (A:, = 1) . 

set row[_rownum+V}.column\_ar\,arrow = "f", 
set /• = / + \qod'"'''"% and go to step 10. 

6. (Point down to first son.) Set row\^rownum+l'}.column\_aQ.arrows"/", 

7. (Place middle sons.) For 1 < / < A:^, 

set J = placement [/•, 0. /-ovv/iw/ti+I] + \t^odename^y2, 
set /•oH'[row/?wwt+l].co/wm/2[a].ar/'o»v = "j". 



8. (Place final son.) Set r = placement It. , Q, rownum+l] + j/"<'^^'"""% 
,nodenanie\ 

set /•oH'[ro»vrti/m+l].co/w/?irt[flr]. arrow = "\". 



set ar = r - (up^''^'''^|+l)/2. 



9. (Draw bar between outside sohs.) 

For al < i < ar, set row[row/7Mm+l].co/w/ww[/].6racAef = "_". 

10. (Determine position for root.) 

Set /705 = MAX{ rowlrownumlnextavailable, (l+r)/2 - n^'x^^^^^yi}. 

11. (/'o//?/ f/oivrt /ro/n rodt.) Set ac = />05 + |/''<>^^'''''"^|/2, 
set rowlrownum+lJ.columnTacl.bracket = "j". 

12. (Write root nodename.) 

For row{^rownum']Mextavailable < i < po5, 

set row[roH'/7w.m].co/wm«[/] = blankentry. 
For < i < \tncdena"^% 

stirow\_rownum'\.column\^pos+i'\.name-t"^'^^"°'"^i'\, 

set row\^rownum'].column£pos+i'].bracket - blank, 

set rowlrownum'].column[_pos+i'].arrow = blank. 
Set row\_rownum'].nextavailable - pos + \t'^odename^ 
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13. Return pos, the placement of t. 

The Printing Program 

The placement procedure builds a representation of the tree in the array row. Once 
this is done, it is straightforward to print the tree. This is done by procedures with 
knowledge of how many rows will fit on a page and the width of a page. The resulting 
output is a collection of pages which can be cut and pasted into a large tree diagram. 

In the implementation of placement, any statements that actually write into the 
column arrays first test a boolean variable. Hence, placement is run, resulting in values 
for the nextavailable fields, which are then used to allocate column arrays of the proper 
size. Several fields were added to the tree nodes to facilitate printing. A width field 
saves computation when determining the width of the nodename, and allows the trailing 
blank to be synthesized by having the width be one more than the actual string width. A 
value field allows a number to be associated with a node and to be printed in brackets 
above the nodename. This is done by adding a value character field to the pageentry 
record. When the value is printed, the width field contains the maximum space needed 
for the nodename or value. 

The final printing program is described below: 

• Compute maximum depth, setting width fields. 

• Allocate row array, with empty column elements. 

• Call placement \_root, 0, 0], with printing inhibited. 

• Allocate column elements of proper size, resetting nextavailable fields. 

• Call placement [_root, 0, 0], with printing enabled. 

• Produce printed output. 

Experience with the Tree Printer 

Another common way of printing trees is the "table of contents" listing. In this 
form, the structure is shown by indentation. All nodes at a given level in the tree are 
given the same indentation in the listing. Such listings often are more compact than 
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tree diagrams, and show local structure adequately. They do not, however, show the 
global structure of the tree nearly as well as a diagram does. 

The tree printing program proved very helpful when designing and debugging the 
analysis routines. There are several data structure that are treelike: programs trees, 
patterns, etc. Procedures were written to convert each of them into trees acceptable to 
the printer. For program trees, which tend to be large, the procedure provides an option 
of truncating the tree at a specified depth. One can set the value of the node to its 
address, its pattern sequence number, its "cost", or other useful information. Below we 
shall see some of the problems with the program. 



/ \ 

B 



[ \ / \ / A 



(A very wide subtree) H I J K 

Figure B.3. Example tree showing a shortcoming of the printing algorithm. 

The principal shortcoming of the algorithm is somewhat difficult to demonstrate 
in a small tree. Figure B.3 is a somewhat contrived example that points out the 
problem. Consider the sons of node D: they are £, £, and G. Since the subtree E is 
"shallow", it is placed quite far to the left. The nodes £ and G must be placed 
considerably farther to the right in order to make room for their subtrees. With larger, 
wider trees, a shallow subtree such as £ can be so far from its brothers that it gets 
"lost". Shallow subtrees in son positions other than the first lead to uneven spacing of 
the sons. One could probably add a pass to the algorithm that, once having established 
the rightmost subtree of a node, then reformats the other subtrees for compactness and 
even spacing. 

The array structure used to represent the tree in memory was chosen in order to 
make the final, printing pass of the program very easy, and since it was written for a 
large computer with virtual memory, was a reasonable choice. For a smaller machine, 
some sort of list structure would probably be better, but would require a more 
sophisticated algorithm for slicing the diagram up into page-sized pieces for printing. 
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This appendix contains the detailed statistics from the sample set. Both the initial 
and final pattern sets are shown, together with their contribution to the total entropy 
estimate. The patterns are shown sorted by various keys, but the detailed listing of 
nodes associated with a pattern is given only for the listing sorted by pattern number. 

C.l. Initial Pattern Set — Sorted by Pattern Number 

In order to conserve space, the full set of nodes matching a given pattern is not 
always given. Any node whose conditional probability exceeds .005 is shown, however. 



(1) varCe] 






f = 40289 H = 1.762, p = 


.136, H contr. = 


.239 


4 cases 






local 17580 .44 


field 7744 


.19 


global 12279 .30 


entry 2686 


.07 



(2) var[*,@] 

f = 40289 H = 3.680, p 
154 cases (18 shown) 

12914 .32 

1 6658 .17 

2 4014 .10 

3 3131 .08 

4 2076 .05 
8 1532 .04 
^ 1517 .04 

(3) var[*,*,9] 

f = 40289 H = .336, p 
16 cases 

38841 .96 

2 432 .01 

3 283 .01 

1 175 .00 
14 104 .00 
8 90 .00 

(4) var[*.»,*.e] 

f = 40289 H = 2.099. p 
67 cases (13 shown) 



16 

14 

1 

32 

15 



26020 

6502 

1509 

1168 

611 



.65 
,16 
,04 
,03 
,02 



(5) list[*....0] 
f = 22481 H = 3.763. p 
72 cases (20 shown) 



136, H contr. = .499 



6 

9 

7 

10 

12 

11 

14 



8 

2 

64 

11 

4 



076. H contr. 



946 
833 
825 
590 
469 
464 
384 



136, H contr. = 

4 71 
6 63 
12 56 

5 54 
15 33 
9 28 



136, H contr. = 



451 
414 
381 
357 
347 



.02 
.02 
.02 
.01 
.01 
.01 
.01 



,046 

.00 
.00 
.00 
.00 
.00 
.00 



285 

.01 
.01 
.01 
.01 
.01 



285 



13 
16 
15 
17 



7 

11 
10 
13 



316 
247 
221 
209 



<<others» 2943 



,01 
,01 
,01 
,01 
,07 



20 


.00 


18 


.00 


13 


.00 


8 


.00 



3 334 .01 

48 215 .01 

211 .01 

<<others» 1769 .04 
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C.l. Initial Pattern Set — Sorted by Pattern Number 



assign 


5085 


.23 


dot 


582 


.03 


addr 


207 


.01 


can 


3573 


.16 


casetest 


572 


.03 


<empty> 


188 


.01 


num 


2623 


.12 


casestmt 


509 


.02 


fextract 


185 


.01 


var 


2420 


.11 


dostmt 


448 


.oz 


unionx 


179 


.01 


itstmt 


1528 


.07 


str 


269 


.01 


construct 


165 


.01 


return 


1409 


.06 


relE 


233 


.01 


plus 


127 


.01 


item 


1068 


.05 


dollar 


219 


.01 


«others» 


892 


.04 


(6) numC§] 


















f = 10220 H 


= 5.132 


. P = 


.034, H contr. 


B 


.177 








408 cases ( 


23 shown 


) 

















2254 


.22 


16 


196 


.02 


9 


73 


.01 


1 


1786 


.17 


13 


161 


.02 


32767 


73 


.01 


2 


748 


.07 


5 


143 


.01 


15 


59 


.01 


3 


537 


.05 


7 


138 


.01 


12 


58 


.01 


16383 


397 


.04 


6 


113 


.01 


14 


54 


.01 


-1 


345 


.03 


8 


92 


.01 


255 


54 


.01 


65535 


242 


.02 


10 


92 


.01 


11 


53 


.01 


4 


221 


.02 


32 


82 


.01 


«others» 


2249 


.22 


(7) listce] 
















f = 7425 H 


= 2.476 


. P = 


.025. H contr. 


s 


.062 








38 cases ( 


11 shown) 














2 


3599 


.48 


5 


322 


.04 


10 


61 


.01 


3 


1139 


.15 


6 


208 


.03 





60 


.01 


1 


1038 


.14 


7 


113 


.02 


9 


54 


.01 


4 


591 


.08 


8 


80 


.01 


«others» 


160 


.02 


(8) callCe] 
















f = 6616 H 


.238 


. P = 


.022, H contr. 


s 


.005 








var 


6370 


.96 


dot 


236 


.04 


dollar 


10 


.00 


(9) callC* 


.6] 
















f = 6616 H 


= 2.961 


. P = 


.022, H contr. 


s 


.066 








27 cases (12 shown 


) 














list 


1816 


.27 


dollar 


209 


.03 


dindex 


56 


.01 


var 


1739 


.26 


str 


204 


.03 


ifexp 


47 


.01 


<empty> 


760 


.11 


call 


189 


.03 


<<others» 


142 


.02 


num 


686 


.10 


addr 


129 


.02 








dot 


571 


.09 


plus 


68 


.01 








(10) call[* 


.♦.@] 
















f = 6616 H 


.112 


. P = 


.022, H contr. 


s 


.002 








<empty> 


6517 


.99 


catchphrase 


99 


.01 








(11) assign 


[@] 
















f = 5826 H 


= 1.413 


. P = 


.020, H contr. 


s 


.028 








9 cases 


















var 


4011 


.69 


uparrow 


102 


.02 


seqindex 


17 


.00 


dot 


1090 


.19 


index 


79 


.01 


register 


9 


.00 


dollar 


450 


.08 


dindex 


63 


.01 


memory 


5 


.00 


(12) assign 


C*.0] 
















f = 5826 H 


= 3.591 


. P = 


.020, H contr. 


= 


.070 








42 cases ( 


18 shown 


) 














call 


1288 


.22 


addr 


208 


.04 


uparrow 


36 


.01 


num 


1078 


.19 


assignx 


153 


.03 


index 


36 


.01 


var 


775 


.13 


minus 


134 


.02 


mwconst 


34 


.01 


dot 


580 


.10 


dindex 


113 


.02 


times 


31 


.01 


plus 


377 


.06 


inlinecall 


72 


.01 


<<others» 


213 


.04 


<empty> 


314 


.05 


arraydesc 


63 


.01 








dollar 


261 


.04 


ifexp 


60 


.01 








(13) dot[e] 


















f = 5726 H 


= 1.400 


. P = 


.019, H contr. 


= 


.027 








9 cases 


















var 


2965 


.52 


dollar 


110 


.02 


minus 


7 


.00 


plus 


2352 


.41 


register 


16 


.00 


call 


7 


.00 


dot 


254 


.04 


num 


11 


.00 


assignx 


4 


.00 
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C.l. Initial Pattern Set — Sorted by Pattern Number 



(14) dot[».e] 
















f = 5726 H = 0.000 


. P = 


.019. H contr. 


= 


.000 








var 5726 1 


.00 






' 








(15) plus[63 
















f = 4039 H = 1.037 


. P = 


.014, H contr. 


s 


.014 








22 cases (11 shown 


) 














var 3396 


.84 


times 


39 


.01 


plus 


16 


.00 


dot 357 


.09 


minus 


25 


.01 


call 


15 


.00 


dollar 60 


.01 


register 


22 


.01 


caseexp 


5 


.00 


num 49 


.01 


index 


21 


.01 


<<others» 


34 


.01 


(16) plus[*.e] 
















f = 4039 H = 1.279 


. P = 


.014, H contr. 


= 


.017 








17 cases (11 shown) 














var 3085 


.76 


call 


51 


.01 


caseexp 


6 


. 00 


num 504 


.12 


times 


18 


.00 


ifexp 


5 


.00 


dollar 182 


.05 


div 


13 


.on 


minus 


5 


.00 


dot 149 


.04 


inl inecall 


10 


.00 


<<oth«rs>> 


11 


.00 


(17) dollarCe] 
















f = 2770 H = 2.051 


. P = 


.009, H contr. 


= 


.019 








8 cases 
















var 968 


.35 


dollar 


109 


.04 


call 


18 


.01 


uparrow 952 


.34 


index 


77 


.03 


assignx 


1 


.00 


dot 568 


.21 


dindex 


77 


.03 








(18) donar[*,@3 
















f = 2770 H = 0.000 


. P - 


.009, H contr. 


= 


.000 








var 2770 1 


.00 














(19) relECe] 
















f = 2348 H = 2.010 


. P = 


.008, H contr. 


s 


.016 








14 cases 
















<empty> 1249 


.53 


call 


50 


.02 


index 


7 


.00 


var 558 


.24 


inlinecall 


15 


.01 


plus 


5 


.00 


dot 230 


.10 


dindex 


12 


.01 


minus 


2 


.00 


dollar 136 


.06 


seqifldex 


11 


.00 


uparrow 


1 


.00 


assignx 64 


.03 


mod 


8 


.00 








(20) relEC*,0] 
















f = 2348 H = .819 


. P = 


.008, H contr. 


= 


.006 








15 cases 
















num 1976 


.84 


call 


4 


.00 


plus 


1 


.00 


var 298 


.13 


dindex 


3 


.00 


addr 


1 


.00 


dot 41 


.02 


mwconst 


3 


.00 


seqindex 


1 


.00 


dollar 10 


.00 


relN 


2 


.00 


length 


1 


.00 


relE 5 


.00 


or 


1 


.00 


register 


1 


.00 


(21) ifstmtCe] 
















f = 1810 H = 3.047 


. P = 


.006, H contr. 


s 


.019 








16 cases 
















relE 593 


.33 


or 


86 


.05 


relLE 


20 


.01 


relN 370 


.20 


dot 


57 


.03 


in 


18 


.01 


not 157 


.09 


call 


48 


.03 


assignx 


3 


.00 


var 135 


.07 


relL 


36 


.02 


dindex 


3 


.00 


and 132 


.07 


relGE 


29 


.02 








relG 101 


.06 


dollar 


22 


.01 








(22) ifstmt[»,@] 
















f = 1810 H = 3.059 


. P = 


.006, H contr. 


s 


.019 








24 cases (15 shown 


) 














list 598 


.33 


signal 


70 


.04 


resume 


12 


.01 


call 324 


.18 


if stmt 


40 


.02 


construct 


11 


.01 


assign 249 


.14 


syserror 


37 


.02 


openstmt 


10 


.01 


return 196 


.11 


goto 


33 


.02 


<<others>> 


32 


.02 


error 82 


.05 


dostmt 


23 


.01 








exit 75 


.04 


casestmt 


18 


.01 
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C.l. Initial Pattern Set — Sorted by Pattern Number 



(23) ifstmt[*,»,e] 

f = 1810 H = 1.329, p 
18 cases (11 shown) 
<empty> 1406 .78 
list 173 .10 
call 69 .04 
assign 65 .04 

(24) return[9] 

f = 1683 H = 2.521, p 
30 cases (12 shown) 

<empty> 839 .50 

var 337 .20 

nuro 148 .09 

list 74 .04 

call 73 .04 



006, H contr, 

if stmt 
case stmt 
open stmt 
dostmt 



.006, H contr. 

caseexp 

constructx 

dollar 

ifexp 

dot 



45 

13 

10 

7 



32 
31 
26 
24 
22 



008 

.02 
.01 
.01 
.00 



014 

.02 
.02 
.02 
.01 
.01 



return 
goto 

fextract 
<<others>> 



relE 
plus 
<<others» 



6 

4 

2 

10 



15 
15 
47 



.00 
.00 
.00 
01 



.01 
01 
,03 



(25) e 

f = 1513 H = 0.000, p 
body 1513 1.00 

(26) body[®] 

f " 1513 H = 0.000, p 
list 1513 1.00 



005, H contr. =» 0.000 



005, H contr, 



0.000 



(27) uparrow[6] 

f = 1270 H = 1.191, p 

7 cases 

plus 823 .65 

var 379 .30 

num 60 .05 



.004, H contr. 

dollar 

minus 

dot 



.005 

4 .00 
2 .00 
1 .00 



register 



00 



(28) itemCQ] 

f = 1214 H = .936, p 

9 cases 

re IE 1021 .84 

list 97 .08 

Ibl 52 .04 



,004, H contr. » 

in 

relL 

relG 



.004 

25 .02 
7 .01 
7 .01 



relN 

relLE 

relGE 



00 
,00 
,00 



(29) itera[*,e] 
f = 1214 H = 3.315, p 
36 cases (19 shown) 



list 

assign 

call 

ifstmt 

casestmt 

null stmt 

return 



425 

187 

170 

102 

42 

41 

37 



.35 
.15 
.14 
.08 
.03 
.03 
.03 



004, H contr, 

num 

dollar 

goto 

openstmt 

caseexp 

signal 

error 



26 
20 
20 
17 
14 
12 
11 



014 

.02 
.02 
.02 
.01 
.01 
.01 
.01 



resume 

and 

exit 

dot 

continue 

<<others>> 



11 
10 
10 
7 
7 
45 



01 
,01 
,01 
,01 
.01 
.04 



(30) casestmt[§] 
f = 639 H = 1.543, p 



11 cases 
dollar 
var 
dot 
call 



450 .70 

67 .10 

55 .09 

37 .06 



002, H contr 

num 

assignx 
seqindex 
minus 



17 
4 
3 
2 



003 



.03 
.01 
.00 
.00 



uparrow 

inlinecall 

index 



00 
,00 
,00 



(31) casestrat[*,8] 
f = 639 H = 0.000, p 
list 639 1.00 



,002, H contr. 



0.000 



(32) casestmt[*,*,0] 

f = 639 H = 2.581, p 

14 cases 

<empty> 269 .42 

syserror 151 .24 

list 61 .10 



002, H contr. = .006 

assign 22 .03 nullstmt 

return 22 .03 goto 

ifstmt 13 .02 casestmt 



7 


.01 


4 


.01 


2 


.00 
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C.l. Initial Pattern Set — Sorted by Pattern Number 



signal 


35 


.05 


exit 


10 


.02 


openstmt 


1 


.00 


call 


32 


.05 


error 


10 


.02 








(33) casetesi 


t[9] 
















f = 572 H = 


.597 


. P = 


.002, H contr. 


= 


.001 








num 


489 


.85 


list 


83 


.15 








(34) casetesi 


t[*,0] 
















f = 572 H = 


2.612 


. P " 


.002, H contr. 


= 


.005 








15 cases 


















list 


194 


.34 


null stmt 


16 


.03 


openstmt 


4 


.01 


assign 


140 


.24 


str 


15 


.03 


exit 


3 


.01 


call 


112 


.20 


return 


11 


.02 


goto 


2 


.00 


num 


41 


.07 


ifexp 


6 


.01 


signal 


2 


.00 


ifstmt 


20 


.03 


casestmt 


5 


.01 


label 


1 


.00 


(35) addr[0] 


















f = 570 H = 


1.824 


. P = 


.002, H contr. 


■a 


.004 








6 cases 


















var 


339 


.59 


uparrow 


68 


.12 


index 


32 


.06 


dot 


84 


.15 


dollar 


32 


.06 


dindex 


15 


.03 


(36) relN[0] 


















f = 538 H = 


2.411 


. P = 


.002, H contr. 


= 


.004 








16 cases 


















var 


209 


.39 


dindex 


6 


.01 


mod 


2 


.00 


dot 


157 


.29 


seqindex 


6 


.01 


fdollar 


1 


.00 


dollar 


69 


.13 


plus 


4 


.01 


uparrow 


1 


.00 


call 


37 


.07 


index 


3 


.01 


length 


1 


.00 


inlinecall 


24 


.04 


<empty> 


2 


.00 








assignx 


14 


.03 


minus 


2 


.00 








(37) , relN£*,0] 
















f » 538 H = 


1.174 


. P = 


.002. H contr. 


s 


.002 








10 cases 


















num 


430 


.80 


call 


7 


.01 


inlinecall 


2 


.00 


var 


53 


.10 


addr 


5 


.01 


mwconst 


1 


.00 


dot 


24 


.04 


seqindex 


5 


.01 








dollar 


8 


.01 


uparrow 


3 


.01 








(38) str[9] 


















f = 511 H =« 


8.735 


. P = 


.002, H contr. 


= 


.015 








449 cases (11 shown 


) 














"Break" 


4 


.01 


"VM." 


3 


.01 


"error" 


3 


.01 


"Trace" 


4 


.01 


'• XXX" 


3 


.01 


"Error # " 


3 


.01 


" . xra" 


3 


.01 


"NIL" 


3 


.01 


"New" 


2 


.00 


« __ » 


3 


.01 


".XM" 


3 


.01 


«others» 


477 


.93 


(39) dostrat[8] 
















f = 484 H = 


1.610 


. P = 


.002, H contr. 


= 


.003 








4 cases 


















<empty> 


219 


.45 


forseq 


84 


.17 








upthru 


170 


.35 


downthru 


11 


.02 








(40) dostmtC' 


».0] 
















f = 484 H = 


1.777 


. P = 


.002, H contr. 


» 


.003 








12 cases 


















<empty> 


277 


.57 


relL 


10 


.02 


in 


3 


.01 


not 


136 


.28 


and 


5 


.01 


var 


3 


.01 


relN 


28 


.06 


relG 


5 


.01 


relLE 


2 


.00 


re IE 


11 


.02 


relGE 


3 


.01 


or 


1 


.00 


(41) dostmtC' 


»,».0] 
















f = 484 H = 


2.224 


. P " 


.002, H contr. 


s 


.004 








13 cases 


















list 


253 


.52 


open stmt 


8 


.02 


inlinecall 


1 


.00 


assign 


66 


.14 


label 


7 


.01 


dostmt 


1 


.00 


ifstmt 


54 


.11 


signal 


3 


.01 


catchmark 


1 


.00 


casestmt 


44 


.09 


null stmt 


3 


.01 
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enable 3 .01 



40 



08 



(42) dostmt[' 


►,•.*.»] 


■( 












f » 484 H = 


.133 


. P = 


.002, H contr. 


= 


.000 , 








<erapty> 


475 


.98 


Item 


9 


.02 








(43) dostmt[' 


*.*.*»• 


>e3 














f » 484 H = 


.292 


. P = 


.002, H contr. 


= 


.000 








8 cases 


















<empty> 


468 


.97 


If stmt 


2 


.00 


goto 


1 


.00 


list 


6 


.01 


return 


2 


.00 


exit 


1 


.00 


assign 


3 


.01 


cal 1 


1 


.00 








(44) rainus[@] 
















f = 469 H = 


2.672 


. P = 


.002, H contr. 


« 


.004 








18 cases 


















var 


224 


.48 


call 


11 


.02 


umlnus 


3 


.01 


dot 


62 


.13 


div 


10 


.02 


length 


3 


.01 


<erapty> 


45 


.10 


minus 


8 


.02 


times 


2 


.00 


plus 


38 


.08 


Index 


7 


.01 


uparrow 


2 


.00 


dollar 


25 


.05 


dindex 


4 


.01 


asslgnx 


1 


.00 


num 


20 


.04 


ifexp 


3 


.01 


abs 


1 


.00 


(45) rainus[*, 


.8] 
















f = 469 H = 


1.590 


. P = 


.002, H contr. 


s 


.003 








14 cases 


















num 


318 


.68 


minus 


4 


.01 


Ifexp 


1 


.00 


var 


91 


.19 


times 


4 


.01 


asslgnx 


1 


.00 


dot 


17 


.04 


Index 


4 


.01 


div 


1 


.00 


dollar 


15 


.03 


addr 


2 


.00 


Inllnecall 


1 


.00 


plus 


8 


.02 


call 


2 


.00 








(46) dindexC®] 
















f = 433 H = 


.195 


. P = 


.001, H contr. 


S 


.000 








var 


420 


.97 


dot 


13 


.03 








(47) dindexC*,®] 
















f = 433 H = 


2.450 


. P = 


.001, H contr. 


s 


.004 








11 cases 


















var 


131 


.30 


minus 


21 


.05 


call 


3 


.01 


num 


102 


.24 


dot 


8 


.02 


uparrow 


2 


.00 


times 


89 


.21 


asslgnx 


4 


.01 


Ifexp 


1 


.00 


plus 


68 


.16 


dollar 


4 


.01 








(48) note®] 


















f = 382 H = 


2.636 


► P = 


.001, H contr. 


s 


.003 








13 cases 


















relE 


118 


.31 


In 


19 


.05 


and 


1 


.00 


var 


97 


.25 


relL 


5 


.01 


relN 


1 


.00 


call 


50 


.13 


relG 


5 


.01 


relGE 


1 


.00 


dot 


47 


.12 


or 


4 


.01 








dollar 


32 


.08 


Ifexp 


2 


.01 








(49) ass1gnx[e} 
















f = 348 H = 


1.308 


. P = 


.001, H contr. 


= 


.002 








7 cases 


















var 


246 


.71 


uparrow 


7 


.02 


dindex 


1 


.00 


dot 


69 


.20 


seqindex 


4 


.01 








dollar 


18 


.05 


Index 


3 


.01 








(50) assignx[*,9] 
















f = 348 H = 


3.349 


. P = 


.001, H contr. 


= 


.004 








21 cases (14 shown 


) 














cal 1 


73 


.21 


asslgnx 


23 


.07 


register 


6 


.02 


num 


71 


.20 


minus 


19 


.05 


Inllnecall 


3 


.01 


dot 


40 


.11 


dollar 


19 


.05 


seqindex 


3 


.01 


plus 


34 


.10 


dindex 


12 


.03 


relE 


2 


.01 


var 


30 


.09 


addr 


6 


.02 


<<others>> 


7 


.02 
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f = 338 H = 


.965 


. P = 


.001, H contr. 


as 


.001 








var 


265 


.78 


dollar 


42 


.12 


dot 


31 


.09 


(52) index[*. 


@] 








' 








f = 338 H = 


2.325 


. P = 


.001, H contr. 


3 


.003 








11 cases 


















var 


106 


.31 


dollar 


13 


.04 


inlinecall 


2 


.01 


times 


85 


.25 


mod 


4 


.01 


ifexp 


1 


.00 


num 


85 


.25 


plus 


3 


.01 


div 


1 


.00 


minus 


36 


.11 


dot 


2 


.01 








(53) times[9' 


1 
















f = 317 H = 


1.702 


. P = 


.001, H contr. 


= 


.002 








13 cases 


















var 


227 


.72 


plus 


9 


.03 


length 


2 


.01 


minus 


22 


.07 


inl inecall 


4 


.01 


ifexp 


1 


.00 


call 


18 


.06 


times 


3 


.01 


abs 


1 


.00 


dot 


16 


.05 


div 


2 


.01 








dollar 


10 


.03 


uminus 


2 


.01 








(54) timesC*. 


Q] 
















f = 317 H » 


.535 


. P = 


.001, H contr. 


s 


.001 








5 cases 


















num 


292 


.92 


dot 


8 


.03 


dollar 


1 


.00 


var 


10 


.03 


call 


6 


.02 








(55) inlinecan[@3 
















f = 262 H = 


3.019 


. P = 


.001, H contr. 


= 


.003 








18 cases 


















BITAND 


86 


.33 


Stop 


9 


.03 


PORT I 


2 


.01 


BITSHIFT 


42 


.16 


BITXOR 


7 


.03 


CONVERT 


1 


.00 


BITOR 


39 


.15 


LDIVMOD 


4 


.02 


PUSH 


1 


.00 


DIVMOb 


26 


.10 


BLOCK 


4 


.02 


use 


1 


.00 


COPY 


17 


.06 


NovaOutLd 


3 


.01 


LongDiv 


1 


.00 


BITNOT 


16 


.06 


NovalnLd 


2 


.01 


LongMult 


1 


.00 


(56) inlinecall[».e] 














f = 262 H = 


.843 


. P = 


.001, H contr. 


= 


.001 








8 cases 


















list 


230 


.88 


var 


7 


.03 


uparrow 


1 


.00 


num 


8 


.03 


dot 


4 


.02 


index 


1 


.00 


<empty> 


7 


.03 


inlinecall 


4 


.02 








(57) and[0] 


















f = 251 11 = 


3.104 


. P = 


.001, H contr. 


= 


.003 








14 cases 


















re IE 


60 


.24 


call 


12 


.05 


or 


5 


.02 


and 


51 


.20 


dot 


8 


.03 


relGE 


3 


.01 


relN 


37 


.15 


relL 


7 


.03 


in 


3 


.01 


var 


32 


.13 


relG 


6 


.02 


relLE 


1 


.00 


not 


20 


.08 


dollar 


6 


.02 








(58) and[*.0] 


1 
















f = 251 H = 


3.113 


. P = 


.001, H contr. 


= 


.003 








17 cases 


















relE 


85 


.34 


or 


10 


.04 


caseexp 


2 


.01 


relN 


42 


.17 


dot 


10 


.04 


index 


2 


.01 


not 


28 


.11 


in 


8 


.03 


and 


1 


.00 


call 


22 


.09 


dollar 


7 


.03 


relGE 


1 


.00 


var 


12 


.05 


relL 


6 


.02 


fdollar 


1 


.00 


relG 


11 


.04 


relLE 


3 


.01 








(59) ifexp[9] 


1 
















f = 211 H = 


2.315 


. P = 


.001, H contr. 


= 


.002 








12 cases 


















relE 


102 


.48 


relN 


6 


.03 


dollar 


4 


.02 


var 


54 


.26 


relL 


6 


.03 


or 


3 


.01 


in 


14 


.07 


dot 


6 


.03 


not 


3 


.01 
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relG 



C.l. Initial Pattern Set 
and 



.04 



Sorted by Pattern Number 
4 .02 relGE 1 



(60) ifexp[*,e] 

f = 211 H = 2.433, p = 
18 cases (11 shown) 
num 115 .55 
dot 25 .12 

call 18 .09 
var 18 .09 

(61) ifexp[*,»,fi3 

f = 211 H = 3.005, p = 
21 cases 



num 

dot 

van 

can 

ifexp 

dollar 

minus 



95 
20 
20 
14 
13 
9 
7 



.45 
.09 
.09 
.07 
.06 
.04 
.03 



(62) fextractC©] 

f = 196 H = 0.000, p = 
list 196 1,00 

(63) fextract[*,8] 

f = 196 H » .672, p « 
call 166 .85 

(64) signaTC®] 

f = 186 H = .524. p » 
var 164 .88 

(65) signaT[*,©3 
f = 186 H = 1.602, p = 



5 cases 
<empty> 
list 



104 
44 



.56 
.24 



(66) signal[*.*,e] 

f = 186 H = 0.000, p = 
<empty> 186 1.00 

(67) intCC[®] 

f = 183 H ' .483, p = 
4 cases 
num 168 .92 
var 12 .07 

(68) intCC[».e] 

f = 183 K = 1.466, p = 

7 cases 

num 133 .73 

var 19 .10 

dot 10 .05 

(69) construct[0] 

f = 183 H = 1.817. p = 
6 cases 

var 90 .49 

uparrow 52 .28 

(70) construct[*,@] 

f = 183 H = 0.000, p = 
list 183 1.00 

(71) unionx[§] 

f = 179 H = 0.000. p = 
var 179 1.00 



.001, H contr. = .002- 

dollar 9 .04 

ifexp 6 .03 

plus 3 .01 

minus 3 .01 



.001, H contr. » .002 

plus 6 .03 

str 4 .02 

caseexp 3 .01 

index 3 .01 

d index 3 .01 

min 3 .01 

not 2 .01 



.001, H contr. » 0.000 



.001, H contr, 
inlinecall 



.001, H contr. 
dot 



.001, H contr. 

var 
dollar 



.000 
28 .14 



.000 
22 .12 



' .001 

31 .17 
5 .03 



.001. H contr. » 0.000 



.001, H contr. = ,000 

dot 2 .01 

minus 1 .01 



.uui, n contr. » .uoi 

plus 7 .04 

div 7 .04 

minus 6 ,03 



,001, H contr, = .001 

dot 26 .14 

dindex 10 .05 



.001, H contr. = 0.000 



,001, H contr. = 0.000 



str 
index 
memory 
<<others» 



signal 



num 



dollar 



dollar 
index 



.00 



.01 
.01 
01 
.03 



uminus 


2 


.01 


addr 


2 


.01 


relN 




.00 


relL 




.00 


times 




.00 


inlinecall 




.00 


constructx 




.00 



01 



01 



01 



,02 
.01 
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(72) unionx[*.0] 
f = 179 H = 0.000, p = 
list 179 1.00 



.001, H contr. 



0.000 



(73) upthru[@] 

f = 170 H = .767, p = 
var 132 .78 

(74) upthru[*,0] 
f = 170 H = 1.220, p = 



4 cases 
in tec 
intCO 



83 
81 



.49 
.48 



,001, H contr. 
<empty> 



.000 
38 .22 



.001, H contr. = 

in too 
in toe 



.001 

.02 
.02 



(75) relG[@] 
f = 159 H = 
11 cases 
var 
dot 

dollar 
assignx 



2.109, p = .001, H contr. = 



94 

21 

10 

9 



.59 
.13 
.06 
.06 



(76) relG[*,@] 

f = 159 H = 1.386, p = 

9 cases 

num 120 .75 

var 14 .09 

dot 11 .07 



<empty> 
plus 
length 
index 



.001, H contr. = 

dollar 

index 

minus 



.001 



8 


.05 


times 


1 


.01 


7 


.04 


inlinecall 


1 


.01 


5 


.03 


abs 


1 


.01 


2 


.01 









.001 

6 .04 
3 .02 
2 .01 



ifexp 

mod 

max 



1 


.01 


1 


.01 


1 


.01 



(77) lbl[@] 
f = 159 H = 
5-cases 
1 
2 



.984, p = .001, H contr, 



127 
23 



.80 
.14 



001 

.03 
.02 



01 



(78) catchphrase[@] 

f = 130 H = 1.040, p = 
item 98 .75 

(79) catchphrase[*,@] 

f = 130 H = 1.024, p = 

8 cases 

<empty> 109 .84 

goto 8 .06 

inlinecall 4 .03 



000, H contr. = .000 
<empty> 20 .15 



.000, H contr, 

continue 

list 

call 



.000 

4 .03 
2 .02 
1 .01 



list 



assign 
error 



12 



09 



,01 
01 



(80) error[@] 
f = 129 H = .065, p = 
var 128 .99 



,000, H contr. = .000 
dot 1 .01 



(81) error[*,0] 

f = 129 H = 1.903, p 

9 cases 

<empty> 59 .46 

var 47 .36 

list 9 .07 



,000, H contr. 

num 
dot 
str 



.001 

6 .05 
3 .02 
2 .02 



ifexp 

plus 

addr 



01 
01 
01 



(82) error[*,*,6] 
f = 129 H = .065, p 
<empty> 



128 



.99 



(83) or[0] 
f = 127 H 
11 cases 
relE 
relN 
not 
or 



2.955, p 



38 
21 
20 
10 



.30 
.17 
.16 
.08 



000, H contr. 
catchphrase 



000, H contr, 

relG 
var 
and 
relL 



.000 
1 .01 



.001 



06 
06 
06 
08 



dot 

call 

assignx 



5 


.04 


2 


.02 


1 


.01 
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(84) or[*,@] 
f = 127 H = 
14 cases 
relE 
and 
reIN 
not 
var 



2.856, p 



43 
25 
21 
10 

7 



.34 
.20 
.17 
.08 
.06 



000. H contr, 

relG 

dot 

call 

caseexp 

relL 



,001 

.04 
.03 
.02 
.02 
.02 



(85) 
f = 
1b1 

(86) 
f = 



goto[0i 

105 H = 



0.000. p = 
105 1.00 



in[03 
102 H = 
6 cases 

var 

<efnpty> 

(87) in[*,e] 
f = 102 H = 

intCC 

(88) intCO[®] 
f = 94 H = 

4 cases 
num 
var 



1.492. p 



,000, H contr. » 0.000 



,000. H contr. = .001 



57 
35 



.56 
.34 



.323. p 
96 .94 



minus 
dollar 



,000, H contr. 
intCO 



.05 
.03 



.000 
6 .06 



.849, p = .000, H contr. 



78 
11 



.83 
.12 



(89) intCO[»,@] 

f = 94 H = 2.573, p » 

9 cases 

var 38 .40 

length 15 .16 

dot 12 .13 



dot 
length 

,000, H contr. 

plus 

dollar 

minus 



(90) relL[@] 
f = 93 H = 
9 cases 
var 

dollar 
<empty> 



2.197, p = .000, H contr. 



51 

10 

9 



.55 
.11 
.10 



(91) relL[*,0] 

f = 93 H = 2.141, p = 

9 cases 

num 46 .49 

var 20 .22 

dot 12 .13 

(92) openstmt[®] 

f = 91 H = 0.000, p = 
<empty> 91 1.00 

(93) openstmt[*,@] 

f = 91 H = 1.052, p = 

8 cases 

list 76 .84 

enable 5 .05 

label 4 .04 



assignx 

dot 

plus 



,000, H contr. 



dollar 
length 
div 



10 
7 
5 



,000 

.04 
.01 



001 

.11 
.07 
.05 



.001 

.10 
.06 
.03 



001 

.06 
.04 
.02 



(94) div[0] 
f = 87 H = 
7 cases 
var 
plus 
dot 



Z.183, p 



37 
23 
11 



.43 
.26 
.13 



,000, H contr. = 0.000 



.000, H contr. = .000 

casestmt 2 .02 

inlinecall 1 .01 

assign 1 .01 



,000, H contr. = .001 

dollar 6 .07 

call 6 .07 

times 3 .03 



dollar 
relGE 
relLE 
in 



plus 
call 



mm 
div 
call 



call 

index 

div 



times 
index 
min 



if stmt 
dostmt 



minus 



2 


.02 


1 


.01 


1 


.01 


1 


.01 



,01 
,01 



,03 
,02 
,02 



02 
,02 
,01 



1 


.01 


1 


.01 


1 


.01 



,01 
,01 



,01 
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(95) div[*,9] 
f = 87 H = .572, p = 


,000. H contr. 


a 


.000 








4 cases 














num 79 .91 


dot 


3 


.03, 








var 4 .05 


times 


1 


.01 








(96) register[0] 
f = 87 H = 0.000, p = 


.000, H contr. 


= 


.000 








num 87 1.00 














(97) caseexp[@] 
f = 84 H = 1.343, p = 


.000, H contr. 


g 


,000 








7 cases 














dollar 62 .74 


call 


2 


.02 


num 


1 


.01 


var 12 .14 


inlinecall 


2 


.02 








dot 4 . 05 


dindex 


1 


.01 








(98) caseexp[«,§] 
f = 84 H = 0.000, p = 


.000, H contr. 


» 


.000 








list 84 1.00 














(99) caseexp[*.*,0] 
f = 84 H = .885, p » 


.000, H contr. 


^ 


.000 








7 cases 














num 73 .87 


dot 


2 


.02 


var 


1 


.01 


call 3 .04 


str 


2 


.02 








ifexp 2 .02 


dindex 


1 


.01 








(100) forseqC@] 
f = 84 H = 0.000, p = 


,000, H contr. 


= 


.000 








var 84 1.00 














(101) forseq[*,@] 
f = 84 H = 2.496, p = 


,000, H contr. 


^ 


.001 








10 cases 














var 30 .36 


index 


4 


.05 


minus 


1 


.01 


dot 25 .30 


num 


4 


.05 


div 


1 


.01 


call 10 .12 


dindex 


3 


.04 








dollar 4 .05 


plus 


2 


.02 








(102) forseq[*,*,03 
f = 84 H = 1.745, p = 


.000, H contr. 


_ 


.000 








5 cases 














dot 49 .58 


var 


10 


.12 


minus 


3 


,04 


call 14 .17 


plus 


8 


.10 








(103) seqindex[03 
f = 82 H = .592, p = 


.000, H contr. 


s ■ 


.000 








var 72 ,88 


dot 


9 


.11 


dindex 


1 


.01 


(104) seqindex[*,e] 
f = 82 H = 1.736, p = 


,000, H contr. 


^ 


,000 








6 cases 














var 52 .63 


dot 


7 


.09 


assignx 


4 


.05 


minus 10 .12 


num 


7 


.09 


plus 


2 


.02 


(105) caseswitch[0] 
f = 81 H = 1.175, p = 


.000, H contr. 


3 


.000 








minus 45 .56 


<empty> 


33 


.41 


plus 


3 


.04 


(106) caseswitch[*,0] 
f = 81 H = 0.000, p = 


.000, H contr. 


= 


.000 








num 81 1.00 














(107) caseswitch[*,*,0] 
f = 81 H = 0.000, p = 


.000, H contr. 


= 


.000 









(108) arraydesc[0] 
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Appendix C: Detailed Pattern Data 



list 



C.l. Initial 

80 H = 0.000, p = 
80 1.00 



Pattern Set — 
.000, H contr. 



Sorted by Pattern Number 

= 0.000 



(109) constructx[®] 

f = 77 H = 0.000, p = 
temp 77 1.00 

(110) constructx[*,@3 

f = 77 H = 0.000. p = 
list 77 1.00 

(111) length[@] 

f = 50 H = .529, p » 
var 44 .88 

(112) row[0] 

f = 47 H = 0.000, p = 
list 47 1.00 

(113) rowcons[e] 

f = 47 H = 0.000, p = 
var 47 1.00 

(114) rowcons[*,@] 

f = 47 H = O.OOO, p = 
PGW 47 1.00 

(115) mwconst[@] 

f = 44 H = 0.000, p = 
list 44 1.00 

(116) resume[e] 

f = 44 H = .774, p = 

4 cases 

<empty> 38 .86 
var 3 .07 

(117) rel6E[9] 

f = 41 H = 1.778, p = 

6 cases 

var 23 .56 

assignx 10 .24 

(118) relGE[*,®] 

f = 41 H = 2.022, p = 

7 cases 

num 18 .44 

var 13 .32 

length 5 .12 

(119) label [@j 

f = 41 H = 1.883, p = 

7 cases 

list 25 .61 

enable 5 .12 

casestmt 4 .10 

(120) labelC*,0] 

f = 41 H = .281, p = 
item 39 .95 

(121) uminus[e] 

f = 35 H = 1.391. p = 

5 cases 

var 25 .71 

dollar 4 .11 

(122) vconstruct[9] 

f = 32 H = 0.000, p = 



.000, H contr. = 0.000 



,000, H contr. » 0.000 



,000, H contr- 
dot 



.000 
6 .12 



.000. H contP. « 0.000 



,000, H contr. = 0.000 



,000, H contr. « 0.000 



,000, H contr. = 0.000 



.000, H contr. = .000 



dot 
list 



.000, H contr. 

dot 
dollar 



,000, H contr. 

dot 

assignx 

dollar 



.000, H contr, = 

ifstrat 

catchmark 

call 



.000, H contr, 
list 



,05 
.02 



,000 

.07 
.07 



.000 

2 .05 
1 .02 
1 .02 



.000 

3 .07 
2 .05 
1 .02 



.000 
2 .05 



,000, H contr. = ,000 

call 3 .09 

dot 2 .06 



000, H contr. = 0.000 



<empty> 
plus 



max 



assign 



dindex 



,02 
,02 



,02 



02 



,03 
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C.l. Initial Pattern Set - 


- Sorted by Pattern Number 






don 


ar 32 1.00 
















(123) 
f = 
list 


vconstruct[*,®] 
32 H = 0.000. p 
32 1.00 


s 


.000, H contr. 


» 


.OOP 








(124) relLE[0] 

f = 30 H » 1.826, p 

8 cases 

var 20 .67 

<empty> 2 .07 

assignx 2 .07 


3 


.000, H contr. 

abs 

times 

dot 


s 

2 
1 

1 


.000 

.07 
.03 
.03 


dollar 
index 


1 

1 


.03 
.03 


(125) 

f = 

5 

num 

var 


relLE[»,0] 
30 H = 1.505, p 
cases 

20 .67 
6 .17 


= 


.000, H contr. 

dot 
Index 


s 

2 
2 


.000 

.07 
.07 


dollar 


1 


.03 


(126) enab1e[@] 
f = 29 H = 0.000, p 
catchphrase 29 1.00 


= 


,000, H contr. 


= 


.000 








(127) 
f = 
list 


enable[*,e] 
29 H = .431, p 
27 .93 


- 


.000, H contr. 
ifstmt 


s 

1 


.000 
.03 


dostmt 


1 


.03 


(128) 
f = 
6 
var 
plus 


mod[0] 
27 H = 2.203. p 
cases 

12 . 44 
5 .19 


= 


.000. H contr. 

assignx 
dot 


3 
3 


.000 

.11 
.11 


inlinecall 
dollar 


3 
1 


.11 
.04 


(129) 
f = 
num 


raod[*.0] 
27 H = .979. p 
20 .74 


SB 


.000. H contr. 
length 


s 

6 


.000 
.22 


var 


1 


.04 


(130) 
f = 
list 


iiiin[@] 
27 H = 0.000. p 
27 1.00 


= 


.000, H contr. 


= 


.000 








(131) 
f = 
num 


stringinit[9] 
27 H = 0.000, p 
27 1.00 


= 


.000, H contr. 


= 


.000 








(132) 
f = 
list 


maxfe] 
25 H = 0.000. p 
25 1.00 


= 


.000, H contr. 


= 


.000 








(133) 
f = 
5 
call 
assi 


catchraark[@] 
24 H = 1.908. p 
cases 

10 .42 
gn 6 .25 


= 


.090, H contr. 

enable 
fextract 


3S 

6 
1 


.000 

.25 
.04 


ifstmt 


1 


.04 


(134) 
f = 
var 


base[e] 
22 H = .439. p 
20 .91 


= 


.000, H contr. 
dot 


2 


.000 
.09 








(135) 

f = 

4 

var 

num 


memoryC@] 
13 H = 1.884. p 
cases 

5 .38 
4 .31 


= 


.000, H contr. 

plus 
minus 


s 

2 
2 


.000 

.15 
.15 








(136) 
f = 
call 


fdonar[9] 
11 H = .845, p 
8 .73 


S 


.000, H contr. 
inlinecall 


3 


.000 
.27 
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C.l. Initial Pattern Set — Sorted by Pattern Number 

137) fdonar[*.0] 

f = 11 H = 0.000. p = .000, H contr. = 0.000 

var 11 1.00 

138) downthru[@] 

f = 11 H = 0.000, p = .000, H contr. = 0.000 

var 11 1.00 

139) downthru[*,@] 

f » 11 H « .946, p = .000, H contr. » .000 

intCO 7 .64 intCC 4 .36 

140) dst[@] 

f = 10 H = 0.000, p = .000, H contr. = 0.000 

var 10 1.00 

141) lstf[9] 

f » 10 H » 0.000, p « .000, H contr. - 0.000 

var 10 1.00 

142) extract[@} 

f = 9 H = 0.000, p = .000, H contr. = 0.000 

list 9 1.00 

143) extract[*,9] 

f - 9 H = 1.224, p = .000, H contr. = ,000 

call 6 .67 uparrow . 2 .22 dollar 1 .11 

144) lst[e] 

f = 9 H = 0.000. p = .000, H contr. = 0.000 

var 9 1 . 00 



dot 1 .12 



145) 


abs[0] 










f = 


8 H = 


1.299, p = 


.000, H contr. 


s 


.000 


var 




5 .62 


can 


2 


.25 



146) start[e] 

f = 8 H = .811, p = .000, H contr. « .000 

var 6 .75 dot 2 .25 

147) start[*,e] 

f = 8 H = 0.000, p = .000. H contr. = 0.000 

<empty> 8 1.00 

148) start[*,*,e3 

f = 8 H = .544. p = .000, H contr. = .000 

<empty> 7 .87 catchphrase 1 .12 

149) intOOCS] 

f = 3 H = 0.000, p = .000, H contr. = 0.000 

var 3 i.OO 

150) intOO[*,03 

f = 3 H = .918, p - .000, H contr. » .000 

var 2 .67 plus 1 .33 

151) 1ntOC[@] 

f = 3 H = 0.000, p « .000. H contr. = 0.000 

var 3 1.00 

152) intOC[*,8] 

f = 3 H = 0.000, p = .000, H contr. = 0.000 

var 3 1.00 

153) stop[e] 

f = 3 H = 0.000. p = .000, H contr. = 0.000 

<empty> 3 1.00 

154) svc[0] 
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Appendix C: Detailed Pattern Data 



C.l. Initial Pattern Set ~ Sorted by Pattern Number 

f = 2 H = 0.000, p = .000, H contr. = 0.000 

num 2 1.00 



HI 



Appendix C: Detailed Pattern Data 



C.2. Initial Pattern Set -- Sorted by Entropy Contribution 



pattern 


freq. 


H 


H 


cumul. 


pattern 


number 






contr. 


H 




2. 


40289 


3.680 


.499 


.499 


var[*,@] 


5. 


22481 


3.763 


.285 


.784 


list[*,...0] 


4. 


40289 


2.099 


.285 


1.069 


var[*,*,*,0] 


1. 


40289 


1.762 


.239 


1.308 


var[@] 


6. 


10220 


5.132 


.177 


1.485 


num[@] 


12. 


5826 


3.591 


.070 


1.556 


assign[*,@] 


9. 


6616 


2.961 


.066 


1.622 


can[*,@] 


7. 


7425 


2.476 


.062 


1.683 


list[@] 


3. 


40289 


.336 


.046 


1.729 


var[*.*,0] 


11. 


5826 


1.413 


.028 


1.757 


assign[@] 


13. 


5726 


1.400 


.027 


1.784 


dotC@] 


17. 


niQ 


2.051 


.019 


1.803 


dollarC©] 


22. 


1810 


3.059 


.019 


1.822 


ifstmt[*,@] 


21. 


1810 


3.047 


.019 


1.840 


ifstmt[@] 


16. 


4039 


1.279 


.017 


1 . 858 


plus[*,@] 


19. 


2348 


2.010 


.016 


1.873 


relE[@] 


38. 


511 


8.735 


.015 


1.888 


str[@] 


24. 


1683 


2.521 


.014 


1.903 


return[e] 


15. 


4039 


1.037 


.014 


1.917 


plus[@] 


29. 


1214 


3.315 


.014 


1.930 


item[*,@] 


23. 


1810 


1.329 


.008 


1.939 


ifstmt[*,*,@] 


20. 


2348 


.819 


.006 


1.945 


re1E[*,@] 


32. 


639 


2.581 


.006 


1.951 


casestmt[*,* ,@] 


8. 


6616 


.238 


.005 


1.956 


can[@] 


27. 


1270 


1.191 


.005 


1.961 


uparrow[@J 


34. 


572 


2.612 


.005 


1.966 


casetest[*,@] 


36. 


538 


2.411 


.004 


1.970 


relN[@] 


44. 


469 


2.672 


.004 


1.975 


minus[@] 


50. 


348 


3.349 


.004 


1.979 


assignx[*,@] 


28. 


1214 


.936 


.004 


1.982 


iteni[@] 


41. 


484 


2.224 


.004 


1.986 


dostmt[*,*,0] 


47. 


433 


2.450 


.004 


1.990 


dindex[*,@i 


35. 


570 


1.824 


.004 


1.993 


addr[@] 


48. 


382 


2.636 


.003 


1.996 


not[@] 


30. 


639 


1.543 


.003 


2.000 


casestmt[@] 


40. 


484 


1.777 


.003 


2.003 


dostmt[*.@] 


55. 


262 


3.019 


.003 


2.005 


inl ineca11[@] 


52. 


338 


2.325 


.003 


2.008 


index[*,@] 


58. 


251 


3.113 


.003 


2.011 


and[*,@] 


39. 


484 


1.610 


.003 


2.013 


dostmt[0] 


57. 


251 


3.104 


.003 


2.016 


and[@] 


45. 


469 


1.590 


.003 


2.018 


minus[* ,@] 


10. 


6616 


.112 


.002 


2.021 


can[*,*.@] 


61. 


211 


3.005 


.002 


2.023 


ifexpr*.*.@1 


37. 


538 


1.174 


.002 


2.025 


relN[*,0] 


53. 


317 


1.702 


.002 


2.027 


times[@] 


60. 


211 


2.433 


.002 


2.029 


ifexp[*.0] 


59. 


211 


2.315 


.002 


2.030 


ifexp[@] 


49. 


348 


1.308 


.002 


2.032 


assignx[0] 


83. 


127 


2.955 


.001 


2.033 


or[@] 


84. 


127 


2.856 


.001 


2.034 


or[*.0] 


33. 


572 


.597 


.001 


2.035 


casetest[0] 


75. 


159 


2.109 


.001 


2.037 


re1G[@] 


69. 


183 


1.817 


.001 


2.038 


construct[@] 


51. 


338 


.965 


.001 


2.039 


index[@] 


65. 


186 


1.602 


.001 


2.040 


signal[*,@] 


68. 


183 


1. 466 


.001 


2.041 




81. 


129 


1.903 


.001 


2.042 


error[*,@] 


89. 


94 


2.573 


.001 


2.042 


intCO[*,@] 


56. 


262 


.843 


.001 


2.043 


inlinecall[*,0] 


76. 


159 


1.386 


.001 


2.044 


relG[*,@] 


101. 


84 


2.496 


.001 


2.045 


forseq[«,0] 


74. 


170 


1.220 


.001 


2.045 


upthru[*,0] 
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C.2. Initial Pattern Set — Sorted by Entropy Contribution 



pattern 


freq. 


H 


H 


cumul. 


pattern 


number 






contr. 


H 




90. 


93 


2.197 


.001 


2.046 


re1L[0] 


91. 


93 


2.141 


.001 


2.047 


relL[*.e] 


94. 


87 


2.183 


.001 


2.047 


div[8] 


54. 


317 


.535 


.001 


2.048 


times[*,§] 


77. 


159 


.984 


.001 


2.048 


lbl[0] 


86. 


102 


1.492 


.001 


2.049 


in[e] 


102. 


84 


1.745 


.000 


2.049 


forseq[*,*,0] 


104. 


82 


1.736 


,000 


2.050 


seqindex[*,@] 


43. 


484 


.292 


.000 


2.050 


(Jostmt[*,*.*.*.0 


78. 


130 


1.040 


.000 


2.051 


catchphrase[0] 


79. 


130 


1.024 


.000 


2.051 


catchphrase[*,0] 


63. 


196 


.672 


.000 


2.052 


fextract[»,e] 


73. 


170 


.767 


.000 


2.052 


upthru[0] 


97. 


84 


1.343 


.000 


2.053 


caseexp[0] 


64. 


186 


.524 


.000 


2.053 


signal[9] 


93. 


91 


1.052 


.000 


2.053 


openstnit[*,0] 


105. 


81 


1.175 


.000 


2.053 


caseswitchi©] 


67. 


183 


.483 


.000 


2.054 


intCC[0] 


46. 


433 


.195 


.000 


2.054 


dindex[9] 


118. 


41 


2.022 


.000 


2.054 


re1GE[*,9] 


88. 


94 


.849 


.000 


2.055 


intC0[9] 


119. 


41 


1.883 


.000 


2.055 


label [@] 


99. 


84 


.885 


.000 


2.055 


caseexp[*,*,9] 


117. 


41 


1.778 


.000 


2.055 


relGE[0] 


42. 


484 


.133 


.000 


2.056 


dostmt[*,*,*,0] 


128. 


27 


2.203 


.000 


2.056 


modCe] 


124. 


30 


1.826 


.000 


2.056 


relLEC©] 


95. 


87 


.572 


.000 


2.056 


div[*,0] 


121. 


35 


1.391 


.000 


2.056 


uminus[0J 


103. 


82 


.592 


.000 


2.056 


seqindex[0] 


133. 


24 


1.908 


.000 


2.057 


catchmark[0] 


125. 


30 


1.505 


.000 


2.057 


relLE[*.0] 


116. 


44 


.774 


.000 


2.057 


resume[0] 


87. 


102 


.323 


.000 


2.057 


in[*,0] 


111. 


50 


.529 


.000 


2.057 


length[0] 


129. 


27 


.979 


.000 


2.057 


niod[*,0] 


135. 


13 


1.884 


.000 


2.057 


meraory[0] 


127. 


29 


.431 


.000 


2.057 


enab1e[*.0] 


120. 


41 


.281 


.000 


2.057 


label[*,0] 


143. 


9 


1.224 


.000 


2.057 


extracti*,0] 


139. 


11 


.946 


.000 


2.057 


downthru[*,0] 


145. 


8 


1.299 


.000 


2.057 


abs[0] 


134. 


22 


.439 


.000 


2.057 


base[0] 


136. 


11 


.845 


.000 


2.058 


fdonar[0] 


80. 


129 


.065 


.000 


2.058 


errop[0] 


82. 


129 


.065 


.000 


2.058 


error[*,*,6] 


146. 


8 


.811 


.000 


2.058 


start[0i 


148. 


8 


.544 


.000 


2.058 


start[*,*,6] 


150. 


3 


.918 


.000 


2.058 


intOO[*,03 


14. 


5726 


0.000 


0.000 


2.058 


dot[*,0] 


18. 


2770 


0.000 


0.000 


2.058 


dol1ar[*,0] 


25. 


1513 


0.000 


0.000 


2.058 





26. 


1513 


0.000 


0.000 


2.058 


body[0] 


31. 


639 


0.000 


0.000 


2.058 


casestmtC*,©] 


62. 


196 


0.000 


0.000 


2.058 


fextract[0] 


66. 


186 


0.000 


0.000 


2.058 


signal[*,*,0] 


70. 


183 


0.000 


0.000 


2.058 


construct[»,0] 


71. 


179 


0.000 


0.000 


2.058 


unionx[0] 


72. 


179 


0.000 


0.000 


2.058 


unionx[*,0] 


85. 


105 


0.000 


0.000 


2.058 


goto[0] 


92. 


91 


0.000 


0.000 


2.058 


openstmt[0] 


96. 


87 


0.000 


0.000 


2.058 


register[0] 


98. 


84 


0.000 


0.000 


2.058 


caseexp[*,0] 


100. 


84 


0.000 


0.000 


2.058 


forseq[0] 


106. 


81 


0.000 


0.000 


2.058 


caseswitch[*,0] 



113 



Appendix C: Detailed Pattern Data 
C.2. Initial Pattern Set — Sorted by Entropy Contribution 



pattern freq. H H cumul, pattern 

number contr. H 

107. 81 0.000 0.000 2.058 caseswitc'h[*,*,( 

108. 80 0.000 0.000 2.058 arraydescC®] 

109. 77 0.000 0.000 2.058 constructx[e] 

110. 77 0.000 0.000 2.058 constructx[»,S] 

112. 47 0.000 0.000 2.058 row[0] 

113. 47 0.000 0.000 2.058 rowcons[8] 

114. 47 0.000 0.000 2.058 rowconsC*,®] 

115. 44 0.000 0.000 2.058 mwconst[@] 

122. 32 0.000 0.000 2.058 vconstruct[§] 

123. 32 0.000 0.000 2.058 vconstruct[»,e] 
126. 29 0.000 0.000 2.058 enableC©] 

130. 27 0.000 0.000 2.058 min[0] 

131. 27 0.000 0.000 2.058 stringinit[0] 

132. 25 0.000 0.000 2.058 max[e] 

137. 11 0.000 0.000 2.058 fdollar[*,0] 

138. 11 0.000 0.000 2.058 downthru[e] 

140. 10 0.000 0.000 2.058 dst[9] 

141. 10 0.000 0.000 2.058 1stt[@3 

142. 9 0.000 0.000 2.058 extractC®] 
144. 9 0.000 0.000 2.058 lst[e] 
147. 8 0.000 0.000 2.058 startf*.®] 
149. 3 0.000 0.000 2.058 intOO[§] 

151. 3 0.000 0.000 2.058 intOC[0] 

152. 3 0.000 0.000 2.058 intOC[*,8] 

153. 3 0.000 0.000 2.058 stop[§] 

154. 2 0.000 0.000 2.058 svc[0] 
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C.3. Initial Pattern Set — Alphabetical by Pattern 



pattern 
number 

25. 
145. 

35. 

57. 

58. 
108. 

11. 

12. 

49. 

50. 
134. 

26. 
8. 
9. 

10. 

97. 

98. 

99. 

30. 

31. 

32. 
105. 
106. 
107. 

33. 

34. 
133. 

78. 

79. 

69. 

70. 
109. 
110. 

46. 

47. 

94. 

95. 

17. 

18. 

39. 

40. 

41. 

42. 

43. 

13. 

14. 
138. 
139. 
140. 
126. 
127. 

80. 

81. 

82. 
142. 
143. 
136. 
137. 

62. 

63. 
100. 
101. 
102. 



freq. 



1513 

8 

570 

251 

251 

80 

5826 

5826 

348 

348 

22 

1513 

6616 

6616 

6616 

84 

84 

84 

639 

639 

639 

81 

81 

81 

572 

572 

24 

130 

130 

183 

183 

77 

77 

433 

433 

87 

87 

2770 

2770 

484 

484 

484 

484 

484 

5726 

5726 

11 

11 

10 

29 

29 

129 

129 

129 

9 

9 

11 

11 

196- 

196 

84 

84 

84 



0.000 

1. 

1. 

3, 

3. 

0. 

1. 



H H pattern 
contr, 

0.000 @ 

299 .000 abs[e] 

824 .004 addr[0] 

104 .003 and[0] 

113 .003 and[*,0] 

000 0.000 arraydesc[e] 

413 .028 assign[0] 

3.591 .070 assign[*,0] 

1.308 .002 assignx[e] 

3.349 .004 assignx[*,0] 

.439 .000 base[0] 

0.000 0.000 bodyC©] 

.238 .005 call[0] 

2.961 .066 can[».0] 

.112 .002 can[*.»,0] 

1.343 .000 caseexpC©] 

0.000 0.000 case6xp[*,0] 

.885 .000 caseexp[*,*,0] 

1.543 .003 casestmt[0] 

0.000 0.000 casestmt[*,0] 

2.581 .006 casestmt[*,*,0] 

1.175 .000 caseswitch[0] 

0.000 0.000 caseswitch[*,0] 

0.000 0.000 caseswitch[*,«,0] 

,597 .001 casetest[0] 

2.612 .005 casetest[*,0] 

1,908 .000 catchmark[0] 

1,040 .000 catchphrase[0] 

1.024 .000 catchphrase[*,0] 

1,817 .001 construct[0] 

0,000 0.000 construct[*,0] 

0.000 0.000 constructx[0] 

0.000 0.000 constructx[*,0] 

,195 .000 dindex[0] 

2,450 ,004 dindex[*,0] 

2,183 .001 div[0] 

,572 .000 div[*.0] 

2,051 ,019 donar[0] 

0,000 0.000 dollar[**0] 

1.610 .003 dostmt[0] 

1.777 ,003 dostmtf*.©] 

2,224 ,004 dostmtC*,«,0] 

.133 .000 dostmt[*. *.*,©] 

,292 ,000 dostmt[*, *,♦,♦,©] 

1.400 ,027 dote©] 

0.000 0.000 dot[»,9] 

0.000 0,000 downthru[©] 

.946 ,000 downthru[*,0] 

0,000 0,000 dst[@] 

0.000 0,000 enableC©] 

,431 ,000 enable[*,@] 

.065 .000 error[0] 

1.903 ,001 error[*,©] 

,065 ,000 error[*,*,@] 

0,000 0,000 extracti©] 

1.224 .000 extractC*,©] 

,845 .000 fdollarC©] 

0,000 0,000 fdollar[*,0] 

0,000 0,000 fextract[0] 

,672 ,000 fextractf*,©] 

0.000 0,000 forseq[©] 

2,496 ,001 forseq[*,0] 

1,745 .000 forseq[*,*,0] 
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C.3. Initial Pattern Set ~ Alphabetical by Pattern 



pattern 


freq. 


H 


H 


number 






contr 


85. 


105 


0.000 


0.000 


59. 


211 


2.315 


.002 


60. 


211 


2.433 


.002 


61. 


211 


3.006 


.002 


21. 


1810 


3.047 


.019 


22. 


1810 


3.059 


.019 


23. 


1810 


1.329 


.008 


86. 


102 


1.492 


.001 


87. 


102 


.323 


.000 


51. 


338 


.965 


.001 


52. 


338 


2.325 


.003 


55. 


262 


3.019 


.003 


56. 


262 


.843 


.001 


67. 


183 


.483 


.000 


68. 


183 


1.466 


.001 


88. 


94 


.849 


.000 


89. 


94 


2.573 


.001 


151. 


3 


0,000 


0.000 


152. 


3 


0.000 


0.000 


149. 


3 


0.000 


0.000 


150. 


3 


.918 


.000 


28. 


1214 


.936 


.004 


29. 


1214 


3.315 


.014 


119. 


41 


1.883 


.000 


120. 


41 


.281 


.000 


77. 


159 


.984 


.001 


111. 


50 


.529 


.000 


7. 


7425 


2.476 


.062 


5. 


22481 


3.763 


.285 


144. 


9 


0.000 


0.000 


141. 


10 


0.000 


0.000 


132. 


25 


0.000 


0.000 


135. 


13 


1.884 


.000 


130. 


27 


0.000 


0.000 


44. 


469 


2.672 


.004 


45. 


469 


1.590 


.003 


128. 


27 


2.203 


.000 


129. 


27 


.979 


.000 


115. 


44 


0.000 


0.000 


48. 


382 


2.636 


.003 


6. 


10220 


5.132 


.177 


92. 


91 


0.000 


0.000 


93. 


91 


1.052 


.000 


83. 


127 


2.955 


.001 


84. 


127 


2.856 


.001 


15. 


4039 


1.037 


.014 


16. 


4039 


1.279 


.017 


96. 


87 


0.000 


0.000 


19. 


2348 


2.010 


.016 


20. 


2348 


.819 


.006 


117. 


41 


1.778 


.000 


118. 


41 


2.022 


.000 


75. 


159 


2.109 


.001 


76. 


159 


1.386 


.001 


124. 


30 


1.826 


.000 


125. 


30 


1.505 


.000 


90. 


93 


2.197 


.001 


91. 


93 


2.141 


.001 


36 = 


638 


?,411 


.004 


37. 


538 


1.174 


.002 


116. 


44 


.774 


.000 


24. 


1683 


2.521 


.014 


112. 


47 


0.000 


0.000 


113. 


47 


0.000 


0.000 


114. 


47 


0.000 


0.000 



pattern 



goto[e] 

"texp[e] 
fexp[*,e] 
fexp[*,*,9] 
fstint[9] 
fstjnt[»,9] 
fstint[*,*,0] 
n[0] 

ndexC®] 

ndex[*,e] 

miiiecan[0] 

miiiecan[*,§] 

ntCC[®] 

ntCC[*.0] 

ntcofej 

ntCO[*,0] 

ntOC[e] 

ntOC[*,e] 

ntOO[S] 

ntOO[*,e] 

teni[@] 

tem[*,8] 
label [0] 
label[*,e] 
lbl[0] 
lengtb[§] 
list[0] 
listC*,...0] 
lst[§] 
IstfC®} 
max [9] 
memory[63 
min[@] 
minus[0] 
minus[*,0] 

mocl[9] 

mod[»,0] 

mwconst[0] 

not[0] 

num[9] 

openstmt[9] 

openstmtC*,®] 

or[0] 

or[*,0] 

plusi0] 

plus[*,03 

register[03 

relE[03 

relE[*,03 

relGE[03 

relGE[*,03 

relG[03 

relG[»,03 

relLE[03 

relLE[*.03 

relL[03 

relL[»,03 

relN[*r03 

resume[03 

return[03 

row[03 

rowcons[03 

rowcons[*,03 
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Appendix C: Detailed Pattern Data 



C.3. Initial Pattern Set — Alphabetical by Pattern 



pattern 


freq. 


H 


H 


pattern 


number 






contr. 




103. 


82 


.592 


.000 


seqindex[0] 


104. 


82 


1.736 


.000 


seqindex[*,6] 


64. 


186 


.524 


.000 


signal[@] 


65. 


186 


1.602 


.001 


signal[»,e] 


66. 


186 


0.000 


0.000 


signal[*,*,e] 


146. 


8 


.811 


.000 


start[6] 


147. 


8 


0.000 


0.000 


startC^.0] 


148. 


8 


.544 


.000 


start[*,*.9] 


153. 


3 


0.000 


0.000 


stop[9] 


38. 


511 


8.735 


.015 


str[@] 


131. 


27 


0.000 


0.000 


stringinit[0] 


154. 


2 


0.000 


0.000 


svc[6] 


53. 


317 


1.702 


.002 


times[@] 


54. 


317 


.535 


.001 


tiraes[*.8] 


121. 


35 


1.391 


.000 


uminus[e] 


71. 


179 


0.000 


0.000 


unionx[0] 


72. 


179 


0.000 


0.000 


unionx[*,8] 


27. 


1270 


1.191 


.005 


uparrow[@] 


73. 


170 


.767 


.000 


upthru[0] 


74. 


170 


1.220 


.001 


upthru[*,®] 


1. 


40289 


1.762 


.239 


var[@] 


2. 


40289 


3.680 


.499 


var[*,6] 


3. 


40289 


.336 


.046 


var[*,*,e] 


4. 


40289 


2.099 


.285 


var[*,*,*,@] 


122. 


32 


0.000 


0.000 


vconstruct[6] 


123. 


32 


O.X)00 


0.000 


vconstruct[*,S] 
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Appendix C: Detailed Pattern Data 



C.4. Final Pattern Set — Sorted by Entropy Contribution 

In the refined pattern set, some patterns are defined in terms of those larger 
patterns "removed" from them. These patterns are denoted by an asterisk (*) in the 
"was refined" column. See the listing sorted by pattern number to find the set of 
patterns refining the given pattern. 



pattern was freq. 
number refined 



6. 

12. 

9. 

155. 

245. 

214. 

243. 

160, 

159. 

194. 

1. 

267. 

158. 

246. 

196. 

5. 

7. 

268. 

11. 

13. 
156. 
242. 
229. 
171. 

17. 

22. 

21. 
205. 
201. 
174. 

19. 
164. 
216. 

16. 

38. 

24. 

29. 
237. 

15. 
163. 
213. 
259. 
186. 
198. 
167. 
173. 
199. 

23. 
215. 
209. 
232. 
255. 
254. 
219. 

20. 



4850 
5663 
6616 
5332 
6617 
4016 
4570 
6685 
12217 
2610 
10800 
4570 
2686 
4263 
3296 
3096 
2874 
2709 
5663 
5697 
2335 
2709 
3114 
6370 
2754 
1810 
1810 
3027 
1909 
5697 
2348 
2754 

944 
3847 

511 
1683 
1214 
1909 
3847 
5697 

752 
1355 
1078 
1445 
3885 
2941 
1929 
1810 
1009 

853 
1445 

828 

670 
1816 
2348 



H 



5.828 
3.574 
2.961 
3.411 
2.639 
4.241 
3.441 
2.327 
1.114 
4.744 
1.141 
2.548 
4.039 
2.425 



2.995 
3.130 
3.163 
3.287 
1.423 
1.401 
3.153 
2.433 
2.078 

.992 
2.050 
3.093 
3.047 
1.795 
2.798 

.855 
2.023 
1.713 
4.912 
1.180 
8.735 
2.521 
3.315 
2.104 
1.032 

.657 
4.743 
2.625 
3.056 
2.275 

.752 

.939 
1.384 
1.333 
2.25i 
2.607 
1.463 
2.427 
2.948 
1.085 

.819 



H 
contr, 

.096 
.069 
.066 
.062 
.059 
.058 
.053 
.053 
.046 
.042 
.042 
.039 
.037 
.035 
.033 
.033 
.031 
.030 
.027 
.027 
.025 
.022 
.022 
.021 
.019 
.019 
.019 
.018 
.018 
.017 
.016 
.016 
.016 
.015 
.015 
.014 
.014 
.014 
.013 
.013 
.012 
.012 
.011 
.011 
.010 
.009 
.009 
.008 
.008 
.008 
.007 
.007 
.007 
.007 
.007 



cumul. pattern 
H 

.096 num[@] 

.164 assign[*,e] 

.231 can[*.0] 

.292 var[global ,0] 

.352 body[list[*,...0]3 

.409 can[var[global,@]] 

.463 dotC*,var[field,@]3 

.515 var[local ,•,♦,6] 

.561 var[global ,♦,•,6] 

.603 {relL|relLE|relE|relN|relGEire1G}[*,num[0]] 

.645 var[@] 

.685 dot[*,var[field,*.*.e]] 

.721 var[entry,@] 

.756 can[*,list[* @]] 

.790 {assign|assignx}[var[local ,8]] 

.823 list|;*,...@] 

.853 listC®] 

.884 dollar[*,var[fie1d,*,*,@]] 

.911 assign[@] 

.938 dot[@] 

.963 vartlocal ,0] 

.985 dollar[*,var[field,0]] 

1.007 assign[var[loca1 ,♦,♦,§]] 

1.029 call[var[@]] 

1.048 donar[@] 

1.067 ifstmtC*,®] 

1.085 ifstmt[@] 

1.104 plus[*,var[local ,0]] 

1.122 list[*,...var[loca1,0]] 

1.138 dot[*.var[0]] 

1.155 relE[0] 

1.171 donar[*.var[*,*,0]] 

1.186 dot[»,var[globa1,0]] 

1.202 plus[*.0] 

1.217 str[0] 

1.231 r8turn[0] 

1.245 item[*.03 

1.258 1ist[*,...var[local,*,*,0]] 

1.272 plus[0] 

1.285 dot[*,var[*,*,0]] 

1.297 assign[var[global ,0]] 

1.309 item[*.listC*,.. .0]] 

1.320 assign[*,num[0]] 

1.331 calif*. var[loca1,0]] 

1.341 assign[var[0]] 

1.350 dot[var[@]] 

1.359 dot[var[local,0]] 

1.367 ifstmt[*,*,0] 

1.3/5 dotLvar[global,0]] 

1 . 383 {relLI relLE | relE | relN| relGE | relG}[var[local ,0]] 

1.390 call[*,var[local .♦.♦,0]] 

1.397 dostmt[*,*,list[*,...0]] 

1.403 {construct|constructx}[*,list[*, . . .0]] 

1.410 call[*,list[0]] 

1.417 relE[*,03 
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C.4. Final Pattern Set -- Sorted by Entropy Contribution 



pattern was 
number refined 



233. 
197. 

32. 
8. 

34. 

27. 
257. 
172. 

36. 
223. 

44. 
230. 

28, 
252. 

41. 

50. 

47. 

35. 

48. 
184. 

30. 
207. 
227. 
191. 
204. 

40. 
262. 

55. 

52. 

58. 

39. 

57, 
190. 
220. 
161. 

45. 
168. 

10. 
192. 
175. 
217. 
157. 

61. 

37. 
264. 
208. 
210. 
221. 
202. 

53. 
179. 

60, 
165. 

59. 

49. 
260. 
188. 
206. 
244. 
180. 
211. 

83. 

84. 
231. 
248. 



freq. 


H 


H 


cumul 


pattern 






contr. 


H 




840 


2.252 


.006 


1.423 


do 11 ar[var[ local ,♦,*,©]] 


610 


2.949 


.006 


1.429 


{assign 1 as sign x}[*,var[ local ,0]] 


639 


2.590 


.006 


1.435 


casestmt[*,*,@] 


6616 


.238 


.005 


1.440 


call[@] 


572 


2.612 


.005 


1.445 


casetest[*,@] 


1260 


1,184 


.005 


1.450 


uparrow[@] 


489 


3,016 


.005 


1.455 


inlinecall[*,list[*,.. .6]] 


1739 


.806 


.005 


1.460 


call[«,var[@]] 


538 


2.411 


.004 


1.464 


relN[e] 


598 


2.132 


.004 


1.469 


ifstmt[*,list[@]] 


469 


2.672 


,004 


1.473 


minus[@] 


590 


2.026 


,004 


1.477 


assign[*,var[local,*,*,@]3 


1214 


.936 


.004 


1.481 


item[@] 


600 


1,863 


.004 


1.485 


casetest[*,list[*, . . .@]] 


484 


2,269 


,004 


1.488 


dostmt[*,*,@] 


323 


3,313 


,004 


1.492 


assignx[* ,@] 


433 


2.450 


,004 


1.495 


dindex[*,@] 


570 


1.824 


.004 


1.499 


addr[@] 


382 


2,636 


.003 


1.502 


not[@] 


757 


1,317 


.003 


1.506 


{index (dindex | seqindex}[var[@]]l 


639 


1,543 


.003 


1.509 


casestmt[@] 


264 


3.453 


.003 


1.512 


{index 1 dindex 1 seqindex}[var[ local ,@]] 


425 


2.096 


.003 


1.515 


item[*,list[@]] 


316 


2.807 


.003 


1.518 


plus[*,num[@]] 


321 


2.684 


.003 


1.521 


plus[var[local ,6]] 


484 


1.777 


.003 


1.524 


dostmt[*,@] 


391 


2.113 


.003 


1.527 


openstmt[*,list[*, . . .0]] 


262 


3.019 


.003 


1.530 


inlinecall[6] 


338 


2,325 


.003 


1.532 


index[*,@] 


251 


3,113 


.003 


1.535 


and[*.@i 


484 


1,610 


.003 


1.537 


dostmtC@] 


251 


3.104 


.003 


1.540 


and[@] 


318 


2,446 


,003 


1.543 


niinus[* ,num[6]] 


639 


1.208 


.003 


1.545 


casestmt[*.list[@3] 


428 


1,749 


,003 


1.548 


var[field,*,*,@] 


469 


1,590 


.003 


1.550 


minus[*,@] 


775 


.959 


.003 


1.553 


assign[* , var[@3] 


6616 


.112 


.003 


1.555 


call[*,*,@] 


292 


2.387 


.002 


1.558 


times[*,num[@]] 


964 


.682 


.002 


1.560 


dollar[var[@]] 


164 


3.938 


.002 


1.562 


signal[var[global,0]] 


428 


1.491 


.002 


1.564 


var[field,e] 


211 


3.005 


.002 


1.567 


ifexp[*,*,@] 


538 


1.174 


.002 


1,569 


relN[*,@] 


686 


.882 


.002 


1.571 


row[list[*, .. .6]] 


245 


2,364 


.002 


1.573 


{index] dindex |seqindex}[* ,var[ local ,@]] 


191 


3.006 


.002 


1.575 


{relL|relLE|relElrelN|relGE|relG}[*,var[local 


253 


2.208 


.002 


1.577 


dostmt[*,*,list[@]] 


199 


2.803 


.002 


1.578 


minus[var[local ,@]] 


317 


1,702 


.002 


1.580 


times[@] 


955 


,549 


.002 


1.582 


{relE|relGlrelNlrelLlrelGEjrelLE}[varC@]] 


211 


2.433 


.002 


1.584 


ifexp[*,@] 


179 


2.818 


.002 


1.586 


unionx[var[*,*,@]] 


211 


2.315 


.002 


1.587 


ifexp[@] 


323 


1,353 


.001 


1.589 


assignx[@] 


187 


2.311 


.001 


1.590 


label[list[*, . ..0]] 


168 


2.569 


.001 


1.592 


intCC[num[0]] 


162 


2.598 


.001 


1.593 


times[var[local ,0]] 


160 


2.536 


.001 


1.594 


arraydesc[list[*,...0]] 


403 


,998 


,001 


1.596 


{relElrelG|relN|relL|relGE|relLE}[*,var[0]] 


355 


1,130 


.001 


1.597 


uparrow[var[local ,0]] 


127 


2.955 


.001 


1.598 


or[0] 


127 


2.856 


.001 


1.600 


or[*,0] 


182 


1.963 


.001 


1.601 


assignx[var[local , ♦,♦,0]] 


982 


.363 


.001 


1.602 


casestnit[»,list[*, . ..0]] 
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C.4. Final Pattern Set — Sorted by Entropy Contribution 



pattern was freq. 
number refined 



H H cumul. pattern 

contr, H 



224. 


173 


2.051 


.001 


1.603 


249. 


159 


2.228 


.001 


1.604 


176. 


2754 


.126 


.001 


1.606 


33. 


572 


.597 


.001 


1.607 


75. 


159 


2.138 


.001 


1.608 


69. 


183 


1.817 


.001 


1.609 


195. 


148 


2.227 


.001 


1.610 


51. 


338 


.965 


.001 


1.611 


166. 


339 


.958 


.001 


1.612 


65. 


186 


1.602 


.001 


1.613 


68. 


183 


1.466 


.001 


1.614 


263. 


160 


1.636 


.001 


1.615 


212. 


129 


2.023 


.001 


1.616 


81. 


129 


1.903 


.001 


1.617 


89. 


94 


2.573 


.001 


1.618 


256. 


354 


.663 


.001 


1.618 


203. 


90 


2.595 


.001 


1.619 


181. 


337 


.692 


.001 


1.620 


56. 


262 


.843 


.001 


1.621 


76. 


159 


1.386 


.001 


1.622 


222. 


196 


1.099 


.001 


1.622 


101. 


84 


2.496 


.001 


1.623 


74. 


170 


1.220 


.001 


1.624 


90. 


93 


2.197 


.001 


1.624 


91. 


93 


2.141 


.001 


1.625 


193. 


87 


2.221 


.001 


1.626 


94. 


87 


2.183 


.001 


1.626 


241. 


129 


1.400 


.001 


1.627 


185. 


289 


.615 


.001 


1.628 


54. 


317 


.535 


.001 


1.628 


269. 


163 


1.027 


.001 


1.629 


77. 


159 


.984 


.001 


1.629 


169. 


223 


.688 


.001 


1.630 


86. 


102 


1.492 


.001 


1.630 


265. 


88 


1.680 


.001 


1.631 


102. 


84 


1.745 


.000 


1.631 


104. 


82 


1.736 


.000 


1.632 


43. 


484 


.292 


.000 


1.632 


187. 


85 


1.606 


.000 


1.633 


78. 


130 


1.040 


.000 


1.633 


79. 


130 


1.024 


.000 


1.634 


63. 


196 


.672 


.000 


1.634 


73. 


170 


.767 


.000 


1.635 


225. 


230 


.549 


.000 


1.635 


226. 


97 


1.190 


.000 


1.635 


97. 


84 


1.343 


.000 


1.636 


258. 


243 


.422 


.000 


1.636 


182. 


372 


.268 


.000 


1.636 


64. 


186 


.524 


.000 


1.637 


93. 


91 


1.052 


.000 


1.637 


105. 


81 


1.175 


.000 


1.637 


67. 


183 


.483 


.000 


1.638 


46. 


433 


.195 


.000 


1.638 


118. 


41 


2.022 


.000 


1.638 


200. 


84 


.963 


.000 


1.639 


117. 


41 


1.954 


.000 


1.639 


88. 


94 


.849 


.000 


1.639 


119. 


41 


1.883 


.000 


1.639 


99. 


84 


.885 


.000 


1.640 


42. 


484 


.133 


.000 


1.640 


128. 


27 


2.203 


.000 


1.640 


247. 


120 


.495 


.000 


1.640 


124. 


30 


1.826 


.000 


1.640 


95. 


87 


.572 


.000 


1.641 


121. 


35 


1.391 


.000 


1.641 



ifstm't[*,*,list[@]] 

casestmt[*.*,1ist[*, . . .@]] 

clonar[*,var[@]] 

casetest[0] 

rel6[@] 

constructed] 

return£num[@]] 

index[0] 

acldr[var[@]] 

signal[*,@] 

intCC[«,@] 

return[list[*, . . .6]] 

upthru[var[1ocal ,0]] 

error[*,@] 

intCO[*,@] 

fextract[list[*, . . .6]] 

minus[*,var[ local ,@]] 

return[var[@]] 

inl inecall[*,Q] 

relG[*.@] 

fextract[list[@]] 

forseq[*,@] 

upthru[*,@] 

re1L[@] 

relL[*,@] 

register[num[@]] 

div[@] 

upthru[var[local .♦,*,@3] 

{index|dindex|seqindex}[*,var[@3] 

times[*,@] 

bump[@] 

lb1[@] 

assignx[var[0]] 

in[@] 

signal[*,list[*, . . .6]] 

forseq[*,*,@] 

seqindex[*,@] 

dostmt[*, *,♦,*,©] 

index[* ,num[@]] 

catchphrase[@] 

catchphrase[*,@] 

fextract[*,0] 

upthru[@] 

inlinecan[*,list[@]] 

item[list[@]] 

caseexp[@] 

item[list[*, . . .@]] 

uparrow[var[@]] 

signal[@] 

openstmt[*,@] 

caseswitch[@] 

intCC[@] 

dindex[@] 

relGEC*.@] 

forseq[var[local ,@]] 

relGE[@] 

intCO[@] 

label[0] 

caseexp[*,*.0] 

dostmt[*,*,*,03 

mod[@] 

caseexp[*,list[*, • • .6]] 

re1LE[@] 

div[*,0] 

uminus[@] 
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Appendix C: detailed pattern uaia 



C.4. Final Pattern Set — Sorted by Entropy Contribution 



pattern 


was freq. 


H 


H 


cumul. 


pattern 




number 


refined 




contr. 


H 






103. 


82 


.592 


.000 


1.641 


seqindex[0] 




133. 


24 


1.908 


.000 


1.641 


catchniark[0] 




125. 


30 


1.505 


.000 


1.641 


relLE[*.e] 




189. 


78 


.477 


.000 


1.641 


intCO[num[@]] 




170. 


30 


1.159 


.000 


1.641 


assignx[*,var[0]3 




116. 


44 


.774 


.000 


1.642 


resume[@] 




87. 


102 


.323 


.000 


1.642 


in[^.0] 




111. 


50 


.529 


.000 


1.642 


length[9] 




129. 


27 


.979 


.000 


1.642 


mod[*.0] 




135. 


13 


1.884 


.000 


1.642 


memory[0] 




183. 


132 


.156 


.000 


1.642 


upthru[var[8]] 




127. 


29 


.431 


.000 


1.642 


enab1e[*,@] 




239. 


50 


.242 


.000 


1.642 


seqindex[*,var[1oca1 , 


.♦.♦.0]] 


270. 


25 


.482 


.000 


1.642 


bumpx[@] 




120. 


41 


.281 


.000 


1.642 


label[*,e] 




143. 


9 


1.224 


.000 


1.642 


extract[*,@] 




139. 


11 


.946 


.000 


1.642 


downthru[*,@] 




145. 


8 


1.299 


.000 


1.642 


abs[e] 




134. 


22 


.439 


.000 


1.642 


base[@] 




136. 


11 


.845 


.000 


1.642 


fdollar[e] 




80. 


129 


.065 


.000 


1.642 


error[6] 




82. 


129 


.065 


.000 


1.642 


error[*,*,©] 




146. 


8 


.811 


.000 


1.642 


start[@] 




148. 


8 


.544 


.000 


1.642 


start[*,*,0] 




150. 


3 


.918 


.000 


1.642 


intOO[*,0] 




154. 


2 


0.000 


0.000 


1.642 


svc[@] 




153. 


3 


0.000 


0.000 


1.642 


stop[@] 




152. 


3 


0.000 


0.000 


1.642 


intOC[*,0] 




151. 


3 


0.000 


0.000 


1.642 


intOC[ei 




149.^ 


3 


0.000 


0.000 


1.642 


intOO[@] 




147. 


8 


0.000 


0.000 


1.642 


start[^,@] 




3. 


♦ 31424 


0.000 


0.000 


1.642 


var[»,*,8] 




14. 


5697 


0.000 


0.000 


1.642 


dot[*.@] 




18. 


2754 


0.000 


0.000 


1.642 


donar[*,0] 




25. 


1513 


0.000 


0.000 


1.642 


6 




26. 


1513 


0.000 


0.000 


1.642 


body[@] 




31. 


639 


0.000 


0.000 


1.642 


casestmt[*,@] 




62. 


196 


0.000 


0.000 


^.642 


fextract[0] 




66. 


186 


0.000 


0.000 


1.642 


signal[*,*,@] 




70. 


183 


0.000 


0.000 


1.642 


construct[*,@] 




71. 


179 


0.000 


0.000 


1.642 


unionx[0] 




72. 


179 


0.000 


0.000 


1.642 


unionx[*,0] 




85. 


105 


0.000 


0.000 


1.642 


goto[@] 




92. 


91 


0.000 


0.000 


1.642 


openstmt[0] 




96. 


87 


0.000 


0.000 


1.642 


register[0] 




98. 


84 


0.000 


0.000 


1.642 


caseexp[*,@] 




100. 


84 


0.000 


0.000 


1.642 


forseq[0] 




106. 


81 


0.000 


0.000 


1.642 


caseswitch[«,@] 




107. 


81 


0.000 


0.000 


1.642 


caseswitch[*,*,@] 




108. 


80 


0.000 


0.000 


1.642 


arraydesc[@] 




109. 


77 


0.000 


0.000 


1.642 


constructx[@] 




110. 


77 


0.000 


0.000 


1.642 


constructx[*,6] 




112. 


47 


0.000 


0.000 


1.642 


row[@] 




113. 


47 


0.000 


0.000 


1.642 


rowcons[@] 




114. 


47 


0.000 


0.000 


1.642 


rovicons[*,@] 




115. 


44 


0.000 


0.000 


1.642 


mwconst[@3 




122. 


32 


0.000 


0.000 


1.642 


vconstruct[@] 




123. 


32 


0.000 


0.000 


1.642 


vconstruct[*,0] 




126. 


29 


0.000 


0.000 


1.642 


enable[@] 




130. 


27 


0.000 


0.000 


1.642 


min[0] 




131. 


27 


0.000 


0.000 


1.642 


stringinit[@] 




132. 


25 


0.000 


0.000 


1.642 


max[@] 




137. 


11 


0.000 


0.000 


1.642 


fdollar[*.0] 




138. 


11 


0.000 


0.000 


1.642 


downthru[@] 




140. 


10 


0.000 


0.000 


1.642 


dst[@] 
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APPtNDIX L.: UETAILED PATTERN DATA 



C.4. Final Pattern Set — Sorted by Entropy Contribution 



pattern was 
number refined 

141. 
142. 
144. 
162. 
177. 
178. 
218. 
228. 
234. 
235. 
236. 
238. 
240. 
250. 
25^1. 
253. 
261. 
266 . 



freq. 


H 


H 

contr. 


cumul. 
H 


pattern 


10 


0.000 


0.000 


1 . 642 


istfC03 


9 


0.000 


0.000 


1.642 


extract[§] 


9 


0.000 


0.000 


1.642 


lst[e] 


2686 


0.000 


.000 


1 . 642 


var[entry,*,*,03 


128 


0.000 


0.000 


1.642 


error[var[0i] 


164 


0.000 


0.000 


1.642 


signal[var[@]] 


80 


0.000 


0.000 


1.642 


arraydesc[l ist[9]] 


44 


0.000 


0.000 


1 . 642 


signar[*,rist[«]] 


1929 


0.000 


0.000 


1.642 


dot[var[local.»,*,03] 


46 


0.000 


0.000 


1.642 


error[*,var[local .•,*.®3] 


113 


0.000 


0.000 


.1.642 


ifstmt[var[local,*,*,833 


58 


0.000 


0.000 


1.642 


seqindex[var[ local ,*,*,i3] 


355 


0.000 


0.000 


1.642 


uparrow[var[Tocal .♦,*.®3] 


572 


0.000 


0.000 


1.642 


caseswitch[*,*,list[*, . . .03] 


254 


0.000 


0.000 


1.642 


casetest[l ist[* , . . .633 


42 


0.000 


0.000 


1 . 642 


catchphrase[list[*, . . .@3] 


133 


0.000 


0.000 


1.642 


mwconst[list[*,. . .@33 


32 


0.000 


0.000 


1.642 


vconstruct[* , 1 IstC* .... @33 
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Appendix C: Detailed Patiern Data 



C.5. Final Pattern Set — Alphabetical by Pattern 

Pattern that have been refined are denoted by an asterisk (*) in the "was refined" 
column. See Section C.5 to find the set of patterns refining the given pattern. 



attern 


1 was freq. 


H 


H 


pattern 


umbel 


-• refined 




contr. 




25. 


1513 


0.000 


0.000 


@ 


145. 


8 


1.299 


.000 


abs[0] 


35. 


570 


1.824 


.004 


addr[@] 


166. 


339 


.958 


.001 


addr[var[0]3 


57. 


251 


3.104 


.003 


and[@] 


58. 


251 


3.113 


.003 


and[».@] 


108. 


80 


0.000 


0.000 


arraydesc[@] 


218. 


80 


0.000 


0.000 


arraydesc[l istC@]] 


244. 


160 


2.536 


.001 


arraydesc[list[*,...@]] 


11. 


5663 


1.423 


.027 


assign[@] 


12. 


5663 


3.574 


.069 


assign[*,@] 


186. 


1078 


3.056 


.011 


assign[*,num[@]] 


168. 


775 


.959 


.003 


assign[*,var[@]] 


230. 


590 


2.026 


.004 


assign[*,var[local,*,*,0]] 


167. 


3885 


.752 


.010 


assign[var[@]] 


213. 


752 


4.743 


.012 


assign[var[global ,©]] 


229. 


3114 


2.078 


.022 


assign[var[loca1 .♦,*,0]] 


49. 


323 


1.353 


.001 


assignx[@] 


50. 


323 


3.313 


.004 


assignx[*,@] 


170. 


30 


1.159 


.000 


assignx[*,var[0]] 


169. 


223 


.688 


.001 


assignx[var[@]] 


231. 


182 


1.963 


.001 


ass ignx[var(; local ,*,*,@]] 


134. - 


22 


.439 


.000 


base[e] 


26. 


1513 


0.000 


0.000 


body[@] 


245. 


6617 


2.639 


.059 


body[list[*....@]] 


269. 


163 


1.027 


.001 


bump[@] 


270. 


25 


.482 


.000 


bumpx[0] 


8. 


6616 


.238 


.005 


callt@3 


9. 


6616 


2.961 


.066 


callC*.©] 


10. 


6616 


.112 


.003 


call[*,*.@] 


219. 


1816 


1.085 


.007 


can[*.list[@]] 


246. 


4263 


2.425 


.035 


call[*,list[*, ...«]] 


172. 


1739 


.806 


.005 


call[*,var[@]i 


198. 


1445 


2.275 


.011 


call[*,var[l<jcal ,0]] 


232. 


1445 


1.463 


.007 


call[*,var[ local ,*,*,6]] 


171. 


6370 


.992 


.021 


call[var[@]] 


214. 


4016 


4.241 


.058 


cal l[var[global ,0]] 


97. 


84 


1.343 


.000 


caseexp[0] 


98. 


84 


0.000 


0.000 


caseexp[*,0] 


99. 


84 


.885 


.000 


caseexp[*,*,0] 


247. 


120 


.495 


.000 


caseexp[*, 1 ist[*, . . .0]] 


30. 


639 


1.543 


.003 


casestmt[0] 


31. 


639 


0.000 


0.000 


casestmt[*,0] 


32. 


639 


2.590 


.006 


casestmt[*,* ,0] 


249. 


159 


2.228 


.001 


casestmt[*,*,list[*, . . .0]] 


220. 


639 


1.208 


.003 


casestmt[*,list[0]] 


248. 


982 


.363 


.001 


casestnit[*,list[*, . . .0]] 


105. 


81 


1.175 


.000 


caseswitch[0] 


106. 


81 


0.000 


0.000 


caseswitch[*,0] 


107. 


81 


0.000 


0.000 


caseswitch[«,*,0] 


250. 


572 


0.000 


0.000 


caseswi tch[*,*,list[*, . . .0 


33. 


572 


.597 


.001 


casetest[0] 


34. 


572 


2.612 


.005 


casetest[* ,0] 


252. 


600 


1.863 


.004 


casetest[*,list[*, . . .0]] 


251. 


254 


0.000 


0.000 


casetest[list[*,. ..0]] 


133. 


24 


1.908 


.000 


catchmark[©] 


78. 


130 


1.040 


.000 


catchphrase[0] 


79. 


130 


1.024 


.000 


catchphrase[*,0] 


253. 


42 


0.000 


0.000 


catchphrase[list[*, . . .0]] 
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Appendix C: Detailed Pattern Data 



C.5. Final Pattern Set — Alphabetical by Pattern 



pattern was freq. 
number refined 



69. 

70. 
109. 
110. 

46. 

47. 

94. 

95. 

17. 

18. 
176. 
164. 
242. 
268. 
175. 
233. 

39. 

40. 

41. 

42. 

43. 
221. 
255. 

13. 

14. 
174. 
163. 
243. 
267. 
216. 
173. 
215. 
199. 
234. 
138. 
139. 
140. 
126. 
127. 

80. 

81. 

82. 
235. 
177. 
142. 
143. 
136. 
137. 

62. 

63. 
222. 
256. 
100. 
101. 
102. 
200. 

85. 

59. 

60. 

61. 

21. 

22. 

23. 
224. 
223. 



183 

183 

77 

77 

433 

433 

87 

87 

2754 

2754 

2754 

2754 

2709 

2709 

964 

840 

484 

484 

484 

484 

484 

253 

828 

5697 

5697 

5697 

5697 

4570 

4570 

944 

2941 

1009 

1929 

1929 

11 

11 

10 

29 

29 

129 

129 

129 

45 

128 

9 

9 

11 

11 

196 

196 

196 

354 

84 

84 

84 

84 

105 

211 

211 

211 

1810 

1810 

1810 

173 

598 



H 



1.817 

0.000 

0.000 

0.000 

.195 

2.450 

2.183 

.572 

2.050 

0.000 

.126 

713 

433 

287 

682 

252 

610 

777 

2.269 

.133 

.292 

2.208 

2.427 

1.401 

0.000 

.855 

.657 

441 

548 

912 

939 

251 

384 

0.000 

0.000 

.946 

0.000 

0.000 

.431 

.065 

1.903 

.065 

0.000 

0.000 

0.000 

1.224 

.845 

0.000 

0.000 

.672 

1.099 

.663 

0.000 

2.496 

1.745 

.963 

000 

315 

433 

005 

047 

093 

333 

051 

132 



H 
contr. 

.001 

0.000 

0.000 

0.000 

.000 

.004 

.001 

.000 

.019 

0.000 

.001 

.016 

.022 

.030 

.002 

.006 

.003 

.003 

.004 

.000 

.000 

.002 

.007 

.027 

0.000 

.017 

.013 

.053 

.039 

.016 

.009 

.008 

.009 

0.000 

0.000 

.000 

0.000 

0.000 

.000 

.000 

.001 

.000 

0.000 

0.000 

0.000 

.000 

.000 

0.000 

0.000 

.000 

.001 

.001 

0.000 

.001 

.000 

.000 

0.000 

.002 

.002 

.002 

.019 

.019 

.008 

.001 

.004 



pattern 

construct[@] 

construct[*,@] 

constructx[@] 

constructx[*,9] 

dindex[@] 

clindex[*,@] 

div[@3 

div[*,@] 

dollar[@] 

dollar[*,@] 

dol1ar[*,var[@]] 

donar[*,var[*,*,@]] 

donar[*,var[fie7d,@]] 

dollar[*,var[field,*,*,0]] 

donar[var[@]] 

do! lar[var[ local .*,*,0]] 

dostmt[@] 

dostmt[*,@] 

dostnit[*.*,0] 

dostmt[*.*,*,0] 

dostmt[*.*,*,*,@] 

dostmt[*,*,list[@]] 

dostmt[*,*,list[*, ...0]] 

dot[@] 

dot[*,0] 

dot[*,var[@]] 

dot[*,var[*,*,0]] 

dot[*,var[field,0]] 

dot[*,var[fi eld, *,*,©]] 

dot[*,var[global,0]] 

dot[var[0]] 

dot[var[global ,0]] 

dot[var[local ,0]] 

dot[var[local ,*,*,0]] 

downthru[0] 

downthru[*,0] 

dst[0] 

enable[0] 

enable[*,0] 

error[0] 

error[*,0] 

error[*,*,0] 

error[*,var[ local ,*,*,0]] 

error[var[0]] 

extract[0] 

extract[*,0] 

fdollarC©] 

fdonar[*,0] 

fextract[0] 

fextract[*,0] 

fextract[list[0]] 

fextract[list[*, . . .0]] 

forseq[0] 

forseq[*,0] 

forseq[*,*,0] 

forseq[var[local ,0]] 

goto[0] 

ifexp[0] 

if ex^r* ^01 

ifexp[*,*,0] 

ifstmt[0] 

ifstrat[*,03 

ifstmt[*,*,0] 

ifstmt[*.«,list[0]3 

ifstmt[*,list[0]] 
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Appendix C: Detailed Pattern Data 



C.5. Final Pattern Set — Alphabetical by Pattern 



pattern 


was freq. 


H 




H 


pattern 


number 


refined 




contr. 




236. 


113 


0.000 





.000 


ifstmt[var[local .•.♦.0]3 


86. 


102 


1.492 




.001 


in[0] 


87. 


102 


.323 




.000 


in[*,0] 


51. 


338 


.965 




.001 


indexC©] 


52. 


338 


2.325 




.003 


index[*,0] 


187. 


85 


1.606 




.000 


inclex[*,num[e]] 


55. 


262 


3.019 




.003 


in! inecan[@] 


56. 


262 


.843 




.001 


inlinecan[*,@] 


225. 


230 


.549 




.000 


iminecall[*,list[@]] 


257. 


489 


3.016 




.005 


inlinecan[*,listC«,...@]] 


67. 


183 


.483 




.000 


intCC[@] 


68. 


183 


1.466 




.001 


intCCC*.0] 


188. 


168 


2.569 




,001 


intCC[num[0]] 


88. 


94 


.849 




,000 


intCO[0] 


89. 


94 


2.573 




.001 


intCO[*.@] 


189. 


78 


.477 




.000 


intCO[num[0]] 


151. 


3 


0.000 





.000 


intOC[03 


152. 


3 


0.000 





.000 


intOC[*,0] 


149. 


3 


0.000 





,000 


intOOC0] 


150. 


3 


.918 




.000 


intOO[*,0] 


28. 


1214 


.936 




.004 


itera[0] 


29. 


1214 


3.315 




.014 


item[*,0] 


227. 


425 


2 . 096 




.003 


item[*.1ist[0]] 


259. 


1355 


2.625 




.012 


item[*.list[*....0]] 


226. 


97 


1.190 




.000 


item[list[0]] 


258. 


243 


.422 




.000 


item[list[*,...0]] 


119. 


41 


1.883 




.000 


1abelC0] 


120. 


41 


.281 




.000 


labelC*,0] 


260. 


187 


2.311 




.001 


labe1[list[*....0]] 


77. 


159 


.984 




.001 


1b1[0] 


111. 


50 


.529 




.000 


length[0] 


7. 


• 2874 


3.163 




.031 


list[0] 


5. 


♦ 3096 


3.130 




.033 


list[*,...0] 


201. 


1909 


2.798 




.018 


list[*, . . .var[ local ,0]] 


237. 


1909 


2.104 




.014 


list[*, . . .var[local,*,*,0]] 


144, 


9 


0.000 





.000 


lst[0] 


141. 


10 


0.000 





.000 


lstf[0] 


132. 


25 


0.000 





.000 


max[0] 


135. 


13 


1.884 




.000 


meraory[0] 


130. 


27 


0.000 





.000 


min[0] 


44. 


469 


2.672 




.004 


minus[0] 


45. 


469 


1.590 




.003 


minus[*,0] 


190. 


318 


2.446 




.003 


minus[* ,num[0]] 


203. 


90 


2.595 




.001 


minus[* .var[ local ,0]] 


202. 


199 


2.803 




.002 


minus[var[local ,0]] 


128. 


27 


2.203 




.000 


inod[0] 


129. 


27 


.979 




.000 


mod[*,0] 


115. 


44 


0.000 





.000 


mwconst[0] 


261. 


133 


0.000 





.000 


mwconst[list[*, . . .0]] 


48. 


382 


2.636 




.003 


not[0] 


6. 


♦ 4850 


5.828 




.096 


num[0] 


92. 


91 


0.000 





.000 


openstmt[0] 


93. 


91 


1.052 




.000 


openstmt[*,0] 


262. 


391 


2.113 




.003 


openstmt[*.list[*,. ..0]] 


83. 


127 


2.955 




.001 


or[0] 


84. 


127 


2.856 




.001 


or[*,0] 


15. 


3847 


1.032 




.013 


plus[0] 


16. 


3847 


1.180 




.015 


plus[*,0] 


191. 


316 


2.807 




.003 


plus[*,num[0]] 


205. 


3027 


1.795 




.018 


p 1 us [♦,var[ local ,0]] 


204. 


321 


2,684 




.003 


plus[var[local ,0]] 


96. 


87 


0.000 





.000 


register[0] 


193. 


87 


2.221 




.001 


register[num[0]] 


19. 


2348 


2.023 




.016 


relE[0] 


20. 


2348 


.819 




.007 


relE[*,0] 
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C.5. Final Pattern Set — Alphabetical by Pattern 



pattern was 
number refined 

117. 
118. 

75. 

76. 
124. 
125. 

90. 

91. 

36. 

37. 
116. 

24. 
263. 
195. 
181. 
112. 
264. 
113. 
114. 
103. 
104. 
239. 
238. 

64. 

65. 

66. 
228. 
265. 
178. 
217. 
146. 
147. 
148. 
153. 

38. 
131. 
154. 

53. 

54. 
192. 
206. 
121. 

71. 

72. 
165. 

27. 
182. 
211. 
240. 

73. 

74. 
183. 
212. 
241. 

1. * 

2. « 

3. * 

4. * 
158. 
162. 

157. « 

161. * 

155. ■• 
159. 

156. < 



^ freq. 


H 


H 

contr. 


pattern 


41 


1.954 


.000 


relGE[e] 


41 


2.022 


.000 


relGE[«,§] 


159 


2.138 


.001 


relG[@] 


159 


1.386 


.001 


relG[*,e] 


30 


1.826 


.000 


relLE[@] 


30 


1.505 


.000 


relLE[*,§] 


93 


2.197 


.001 


relL[@] 


93 


2.141 


.001 


re1L[*.9] 


538 


2.411 


.004 


relNCS] 


538 


1.174 


.002 


relN[»,0] 


44 


.ITi 


.000 


resume[@] 


1683 


2.521 


.014 


return[0] 


160 


1.636 


.001 


return[list[*....9]] 


148 


2.227 


.001 


return[num[@]] 


337 


.692 


.001 


return[var[e]] 


47 


0.000 


0.000 


row[@] 


686 


.882 


.002 


row[list[*,...Q]] 


47 


0.000 


0.000 


rowcons[@3 


47 


0.000 


0.000 


rowcons[*,0] 


82 


.592 


.000 


seqif»dex[e] 


82 


1.736 


.000 


seqindex[*,9] 


50 


.242 


.000 


seqindex[*,var[local ,*,*,9 


58 


0.000 


0.000 


seqindex[var[ local ,*,*,0]] 


186 


.524 


.000 


signal[@] 


186 


1.602 


.001 


signal[*.e] 


186 


0.000 


0.000 


signal[*,*,@] 


44 


0.000 


0.000 


signal[*,list[e]] 


88 


1.680 


.001 


signal[*.list[*,...@]] 


164 


0.000 


0.000 


signal[var[@]] 


164 


3.938 


.002 


signal[var[global,@]] 


8 


.811 


.000 


start[@] 


8 


0.000 


0.000 


start[*,S] 


8 


.544 


.000 


start[*,*,0] 


3 


0.000 


0.000 


stop[®] 


511 


8.735 


.015 


str[0] 


27 


0.000 


0.000 


stringinit[8] 


2 


0.000 


0.000 


svc[@] 


317 


1.702 


.002 


times[@] 


317 


.535 


.001 


times[*,@] 


292 


2.387 


.002 


times[*,nuni[0]] 


162 


2.598 


.001 


times[var[local ,@]] 


35 


1.391 


.000 


uminus[0] 


179 


0.000 


0.000 


unionx[0] 


179 


0.000 


0.000 


unionx[*,9] 


179 


2.818 


.002 


unionx[var[*,*,0]] 


1260 


1.184 


.005 


uparr6w[@] 


372 


.268 


.000 


uparrow[var[@jj 


355 


1.130 


.001 


uparrow[var[local ,0]] 


355 


0.000 


0.000 


uparrow[var[ local ,*,*,0]] 


170 


.767 


.000 


upthru[0] 


170 


1.220 


.001 


upthru[*,0] 


132 


.156 


.000 


upthru[var[0]] 


129 


2.023 


.001 


upthru[var[local ,0]] 


129 


1.400 


.001 


up thru[var[ local ,*,*,0]] 


10800 


1.141 


.042 


var[0] 





0.000 


0.000 


var[*,0] 


31424 


0.000 


0.000 


var[*,*,0] 





0.000 


0.000 


var[*, ♦,*,©] 


2686 


4.039 


.037 


varfentry,©] 


2686 


0.000 


0.000 


var[entry,«,*,0] 


428 


1.491 


.002 


var[field,0] 


428 


1.749 


.003 


var[field,*,*,0] 


5332 


3.411 


.062 


var[global ,0] 


12217 


1.114 


.046 


var[global ,*,*,0] 


2335 


3.153 


.025 


var[local ,0] 
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C.5. Final Pattern Set — Alphabetical by Pattern 



pattern 


was 


freq. 


H 


H 


number 


refined 






contr. 


160. 


* 


6685 


2.327 


.053 


122. 




32 


0.000 


0.000 


123. 




32 


0.000 


0.000 


266. 




32 


0.000 


0.000 


197. 




610 


2.949 


.006 


196. 




3296 


2.995 


.033 


254. 




670 


2.948 


.007 


185. 




289 


.615 


.001 


208. 




245 


2.364 


.002 


184. 




757 


1.317 


.003 


207. 




264 


3.453 


.003 


180. 




403 


.998 


.001 


179. 




955 


.549 


.002 


194. 




2610 


4.744 


.042 


210. 




191 


3.006 


.002 


209. 




853 


2.607 


.008 



pattern 



var[local ,*,*,@] 

vconstruct[0] 

vconstruct[*,@] 

vconstruct[*,list[*, . . .@]] 

{assign | ass ignx)[* ,var[ local ,@]] 

{assign|assignx}[var[local ,0]] 

{construct I con St ructx}[*,l ist[*, . . .6]] 

{index|dindex|seqindex}[«, var[@i] 

{index |dindex|seqindex}[*,var[local ,@]] 

{indexjdindexl seqindex}[var[6]] 

{index jdindexj seqindex}[var[ local ,@]] 

{relE|relG|relN|re1L|relGE|relLE}i*,var[0]] 

{relE|relG|relN|relL|relGE|relLE}[var[@]] 

{relL|relLE|relE|relN|relGE|relG)[*.num[0]] 

{relL|relLE|relE|relN|relGE|relG}[«,var[local,@]] 

{relL|relLE|relE|relN|relGE|relG}[var[local,e]] 
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Appendix C: Detafled Pattern Data 



C.6. Final Pattern Set — Sorted by Pattern Number 



(1) van 


[0] 








f = 10800 


H = 1.141, 


P = 


.037, 


H contr. = .042 


SEE ALSO: 


f H 


P 


contr 




(166) 


339 .958 


.001 


.001, 


addr[var[@]] 


(167) 


3885 .752 


.013 


.010, 


assign[var[@]] 


(168) 


775 .959 


.003 


.003. 


assign[*,var[@]] 


(169) 


223 .688 


.001 


.001, 


assignx[var[@]] 


(170) 


30 1.159 


.000 


.000. 


assignx[*, var[@]] 


(171) 


6370 .992 


.022 


.021, 


call[var[@]] 


(172) 


1739 .806 


.006 


.005, 


calif*, var[@]] 


(173) 


2941 .939 


.010 


.009, 


dot[var[@]] 


(174) 


5697 .855 


.019 


.017, 


dot[*,var[@]] 


(175) 


964 .682 


.003 


.002, 


dollar[var[@]} 


(176) 


2754 .126 


.009 


.001, 


dollar[*,var[@]] 


(177) 


128 0.000 


.000 


0.000, 


error[var[@]] 


(178) 


164 0.000 


.001 


0.000, 


signal[var[@]] 


(179) 


955 .549 


.003 


.002, 


{relE|relG|relN|relL|relGE|relLE}[var[@]} 


(180) 


403 .998 


.001 


.001, 


{relE|relG|relN|relL|relGEirelLE}[*,var[@]] 


(181) 


337 .692 


.001 


.001, 


return[var[@]] 


(182) 


372 .268 


.001 


.000, 


uparrow[var[@]] 


(183) 


132 .156 


.000 


.000, 


upthru[var[e]] 


(184) 


757 1.317 


.003 


.003, 


{index 1 dindex | seq index} [var [@]] 


(185) 


289 .615 


.001 


.001, 


{indexjdindex |seqindex}[*,var[@]] 


4 cases 








local 


6816 


63 


field 209 .02 


global 


3632 


34 


entry 143 .01 


(2) var 


[*.9] 








f = 


H = 0.000, 


p = C 


1.000, 


H contr. = 0.000 


SEE ALSO: 


f H 


P 


contr 




(155) , 


5332 3.411 


.018 


.062, 


var[global ,@] 


(156) 


2335 3.153 


.008 


.025, 


var[local,@] 


(157) 


428 1.491 


.001 


.002, 


var[field,@] 


(158) 


2686 4.039 


.009 


.037, 


var[entry,@] 


(3) var 


[♦.*.©] 








f = 31424 


H = 0.000, 


P = 


.106, 


H contr. = 0.000 


SEE ALSO: 


f H 


P 


contr 




(163) 


5697 .657 


.019 


.013, 


dot[*,var[*,*,@]] 


(164) 


2754 1.713 


.009 


.016, 


dol lar[*,var[*,*,@]] 


(165) 


179 2.818 


.001 


.002, 


unionx[var[*,*,@i] 





31424 1. 


00 






(4) var 


[♦.♦,•.6] 








f = 


H = 0.000, 


p = 


1.000. 


H contr. = 0.000 


SEE ALSO: 


f H 


P 


contr 




(159) 


12217 1.114 


.041 


.046. 


var[global .♦,*,@] 


(160) . 


6685 2.327 


.023 


. u53 , 


var[local ,*,*,8] 


(161) 


428 1.749 


.001 


.003, 


var[field, *,♦,§] 


(162) 


2686 0.000 


.009 


0.000, 


var[entry ,*,*,0] 


(5) list[* 6] 








f = 3096 


H = 3.130, 


P = 


.010, 


H contr. = .033 


SEE ALSO: 


f H 


P 


contr 




(244) 


160 2.536 


.001 


.001, 


arraydesc[list[*, . . .@]] 


(245) 


6617 2.639 


.022 


.059, 


body[list[*,. . .0]] 


(246) 


4263 2.425 


.014 


.035, 


call[»,list[*,...@]] 


(247) 


120 .495 


.000 


.000, 


caseexp[*,list[*, . . .6]] 


(248) 


982 .363 


.003 


.001. 


casestmt[*,list[*, . . .0]] 


(249) 


159 2.228 


.001 


.001, 


cases tmt[*,*,list[*, . . .0]] 


(250) 


572 0.000 


.002 


0.000, 


caseswitch[*,*,list[*,...0]] 


(251) 


254 0.000 


.001 


0.000, 


casetest[list[*, . . .0]] 


(252) 


600 1.863 


.002 


.004, 


casetest[*,list[*, . . .0]] 


(253) 


42 0.000 


.000 


0.000, 


catchphrase[list[*, . . .0]] 


(254) 


670 2.948 


.002 


.007, 


{con struct! const ructx)[«,list[*, . . .0]] 


(255) 


828 2.427 


.003 


.007, 


dostmt[*,*,list[*,...0]] 


(256) 


354 .663 


.001 


.001, 


fextract[list[*,. . .0]] 
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Appendix C: Detailed Pattern Data 



C.6. Final Pattern Set — Sorted by Pattern Number 



(257) 


489 


3.016 


.002 


.005, 


inlinecan[*,list[*, . . 


.0]] 






(258) 


243 


.422 


.001 


.000. 


item[1ist[*,...0]] 








(259) 


1355 


2.625 


.005 


.012. 


item[*,list[*,. ..0]] 








(260) 


187 


2.311 


.001 


.001, 


label[list[*,...@]J 








(261) 


133 


0.000 


.000 


0.000, 


niwconst[list[*, ...©]] 








(262) 


391 


2.113 


.001 


.003, 


openstmt[*.list[*, . . .0]] 






(263) 


160 


1.636 


.001 


.001. 


return[list[»,. ..0]] 








(264) 


686 


.882 


.002 


.002. 


row[list[*,. . .0]] 








(265) 


88 


1.680 


.000 


.001, 


signal[*,list[*,. ..0]] 








(266) 


32 


0.000 


.000 


0.000, 


vconstruct[*, list[*, . . 


.0]] 






44 cases (17 


shown] 


1 












assign 


( 


966 


.31 


num 


82 .03 


signal 


25 


.01 


call 




901 


.29 


dostmt 73 .02 


dot 


23 


.01 


ifstmt 




265 


.09 


casestmt 63 .02 


unionx 


23 


.01 


var 




168 


.05 


fextract 57 .02 


dollar 


22 


.01 


return 




140 


.05 


construct 43 .01 


exit 


18 


.01 


<empty> 




85 


.03 


bump 


40 .01 


<<others>> 


102 


.03 


(6) num| 


[0] 
















f = 4850 


H = 


5.828, 


. P = 


.016, 


H contr. = .096 








SEE ALSO: 


f 


H 


P 


contr 










(186) 


1078 


3.056 


.004 


.011, 


assign[*,num[0]] 








(187) 


85 


1.606 


.000 


.000. 


index[».num[0]] 








(188) 


168 


2.569 


.001 


.001, 


intCC[nuni[0]] 








(189) 


78 


.477 


.000 


.000. 


intCO[num[0]] 








(190) 


318 


2.446 


.001 


.003, 


minus[*.num[0]] 








(191) 


316 


2.807 


.001 


.003, 


plus[*,num[0]] 








(192) 


292 


2.387 


.001 


.002. 


times[*,num[0]3 








(193) 


87 


2.221 


.000 


.001. 


register[num[0]] 








(194) 


2610 


4.744 


.009 


.042, 


{relL|relLElrelE|relN| 


relGE|relG}[*,num[0]] 


(195) 


148 


2.227 


.001 


.001, 


return[num[0]] 








388 cases (25 


shown] 


1 

















930 


.19 


65535 91 .02 


11 


33 


.01 


1 


1 


606 


12 


16383 85 .02 


256 


30 


.01 


2 




318 


.07 


6 


80 .02 


40 


29 


.01 


3 




195 


.04 


7 


75 .02 


17 


27 


.01 


4 




152 


.03 


8 


60 .01 


12 


26 


.01 


13 




114 


02 


10 


58 .01 


14 


26 


.01 


16 




104 


02 


32 


56 .01 


63 


26 


.01 


5 




103 


.02 


9 


44 .01 


«others» 


1450 


.30 


-1 




93 


.02 


255 


39 .01 








(7) lis 


t[@] 
















f = 2874 


H = 


3.163, 


. P = 


.010, 


H contr. = .031 








SEE ALSO: 


f 


H 


P 


contr 










(218) 


80 


0.000 


.000 


0.000, 


arraydesc[list[03] 








(219) 


1816 


1.085 


.006 


.007. 


can[*,list[0]] 








(220) 


639 


1.208 


.002 


.003. 


casestmt[*,list[0]] 








(221) 


253 


2.208 


.001 


.002. 


dostmt[*,*,list[0]] 








(222) 


196 


1.099 


.001 


.001, 


fextractilist[0]] 








(223) 


598 


2.132 


.002 


.004, 


ifstmt[*,list[0]] 








(224) 


173 


2.051 


.001 


.001, 


ifstmt[*,*.list[e]] 








(225) 


230 


.549 


.001 


.000, 


inlinecani*.list[0]] 








(226) 


97 


1.190 


.000 


.000. 


item[list[0]i 








(227) 


425 


2.096 


.001 


.003. 


item[»,list[0]] 








(228) 


44 


0.000 


.000 


0.000. 


signal[*,list[0]] 








36 cases (15 


Shown] 


1 












2 


1 


539 


29 


7 


79 .03 


12 


20 


.01 


1 


1 


311 


18 


8 


65 .02 


13 


15 


.01 


3 


i 


*82 


17 





60 .02 


16 


15 


.01 


4 




?79 


10 


10 


53 .02 


<<others>> 


61 


.02 


5 




184 


06 


9 


44 .02 








6 




137 


05 


11 


30 .01 









(8) call[0] 
f = 6616 H = .238. p 
var 6370 .96 



022, H contr. = .005 
dot 236 .04 



dollar 



10 



.00 



(9) call[*,0] 
f = 6616 H = 2.961, p 



.022, H contr. = .066 
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C.6. Final Pattern Set — Sorted by Pattern Number 



27 cases (12 shown) 



list 


1816 


.27 


dollar 


209 


.03 


dindex 


56 


.01 


var 


1739 


.26 


str 


204 


.03 


ifexp 


47 


.01 


<empty> 


760 


.11 


call 


189 


.03 


<<others» 


142 


.02 


num 


686 


.10 


addr 


129 


.02 ' 








dot 


571 


.09 


plus 


68 


.01 








(10) can[*.*,e] 
















f = 6616 H 


.112 


. P = 


.022, H contr. 


s 


.003 








<empty> 


6517 


.99 


catchphrase 


99 


.01 








(11) assign[e] 
















f = 5663 H 


= 1.423 


. P = 


.019, H contP. 


=* 


.027 








9 cases 


















var 


3885 


.69 


uparrow 


100 


.02 


seqindex 


17 


.00 


dot 


1065 


.19 


index 


79 


.01 


register 


9 


.00 


dollar 


440 


.08 


dindex 


63 


.01 


memory 


5 


.00 


(12) assign[*.e] 
















f = 5663 H 


= 3.574 


. P = 


.019, H contr. 


« 


.069 




• 




43 cases (18 shown 


) 














call 


1288 


.23 


addr 


208 


.04 


uparrow 


36 


.01 


num 


1078 


.19 


assignx 


151 


.03 


index 


36 


.01 


var 


775 


.14 


minus 


134 


.02 


mwconst 


34 


.01 


dot 


580 


.10 


dindex 


113 


.02 


times 


31 


.01 


<einpty> 


314 


.06 


inlinecall 


72 


.01 


<<others>> 


215 


.04 


dollar 


261 


.05 


arraydesc 


63 


.01 








plus 


214 


.04 


ifexp 


60 


.01 









(13) dot[0] 
f = 5697 H = 1.401, p 
. 9 cases 



.019, H contr. 



027 



var 2941 .52 


dollar 


109 


.02 


minus 


7 


.00 


plus 2349 .41 


register 


16 


.00 


call 


7 


.00 


dot 253 .04 


num 


11 


.00 


assignx 


4 


.00 


(14) dote*.®] 














f = 5697 H = O.OOO, p = 


.019, H contr. 


= 


.000 








var 5697 1.00 














(15) plus[0] 














f = 3847 H = 1.032, p = 


.013, H contr. 


s 


.013 








22 cases (11 shown) 














var 3243 .84 


times 


39 


.01 


plus 


16 


.00 


dot 331 .09 


minus 


25 


.01 


call 


15 


.00 


dollar 49 .01 


register 


22 


.01 


caseexp 


5 


.00 


num 49 .01 


index 


21 


.01 


«others» 


32 


.01 


(16) plus[*,e] 














f = 3847 H = 1.180, p = 


.013, H contr. 


s 


.015 








A ~1 AAAAA /44 U-^ -.\ 














x/ «^aaco (■'"'■ ^><<J*'iij 














var 3083 .80 


call 


51 


.01 


caseexp 


6 


.00 


num 316 .08 


times 


18 


.00 


ifexp 


5 


.00 


dollar 181 .05 


div 


13 


.00 


minus 


5 


.00 


dot 148 .04 


inlinecall 


10 


.00 


<<others>> 


11 


.00 


(17) donar[§] 














f = 2754 H = 2.050, p = 


.009, H contr. 


s 


.019 








8 cases 














var 964 .35 


dollar 


106 


.04 


call 


18 


.01 


uparrow 944 .34 


index 


77 


.03 


assignx 


1 


.00 


dot 567 .21 


dindex 


77 


.03 









(18) dollar[*,e] 
f = 2754 H = 0.000, p 
var 2754 1.00 



009, H contr. = 0.000 



(19) relE[e] 
f = 2348 H = 2.023, p 
15 cases 



008, H contr, 



016 
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<einpty> 


1249 


.53 


call 


50 


.02 


index 


7 


.00 


var 


558 


.24 


inlinecall 


15 


.01 


bumpx 


6 


.00 


dot 


230 


.10 


dindex 


12 


.01 


plus 


5 


.00 


dollar 


136 


.06 


seqindex 


11 


.00 


minus 


2 


.00 


assignx 


58 


.02 


mod 


8 


.00 ' 


uparrow 


1 


.00 


(20) relE[* 


.0] 
















f = 2348 H 


.819 


. P = 


.008, H contr. 


= 


.007 








15 cases 


















num 


1976 


.84 


call 


4 


.00 


plus 


1 


.00 


var 


298 


.13 


dindex 


3 


,00 


addr 


1 


.00 


dot 


41 


.02 


mwconst 


3 


.00 


seqindex 


1 


.00 


dollar 


10 


.00 


relN 


2 


.00 


length 


1 


.00 


relE 


5 


.00 


or 


1 


.00 


register 


1 


.00 


(21) ifstmt 


[@] 
















f = 1810 H 


= 3.047 


. P = 


,006, H contr. 


s 


.019 








16 cases 


















relE 


593 


.33 


or 


86 


,05 


relLE 


20 


.01 


relN 


370 


.20 


dot 


57 


,03 


in 


18 


.01 


not 


157 


.09 


call 


48 


.03 


assignx 


3 


.00 


var 


135 


.07 


relL 


36 


.02 


dindex 


3 


.00 


and 


132 


.07 


relGE 


29 


.02 








relG 


101 


.06 


dollar 


22 


.01 








(22) ifstmtC*,®] 
















f = 1810 H 


= 3.093 


. P = 


.006, H contr. 


= 


.019 








25 cases (16 shown) 














list 


598 


.33 


signal 


70 


.04 


resume 


12 


.01 


call 


324 


.18 


ifstmt 


40 


.02 


construct 


11 


.01 


assign 


239 


.13 


syserror 


37 


.02 


openstmt 


10 


.01 


return 


196 


.11 


goto 


33 


.02 


bump 


10 


.01 


error 


82 


.05 


dostmt 


23 


.01 


<<others» 


32 


.02 


exit 


75 


.04 


casestmt 


18 


.01 








(23) ifstmt[*.»,@] 
















f = 1810 H 


= 1.333 


. P = 


.006, H contr. 


s 


.008 








19 cases ( 


11 shown 


) 














<empty> 


1406 


.78 


ifstmt 


45 


.02 


return 


6 


.00 


list 


173 


.10 


casestmt 


13 


.01 


goto 


4 


.00 


call 


69 


.04 


openstmt 


10 


.01 


f ex tract 


2 


.00 


assign 


64 


.04 


dostmt 


7 


,00 


<<others>> 


11 


.01 


(24) returnC@] 
















f = 1683 H 


= 2.521 


. P = 


,006, H contr. 


s 


.014 








30 cases ( 


12 shown 


) 














<empty> 


839 


.50 


caseexp 


32 


.02 


relE 


15 


,01 


var 


337 


.20 


constructx 


31 


.02 


plus 


15 


.01 


num 


148 


.09 


dollar 


26 


.02 


<<others» 


47 


.03 


list 


74 


.04 


ifexp 


24 


.01 








call 


73 


.04 


dot 


22 


,01 









(25) 6 
f = 1513 H = 0.000. p = 
body 1513 1,00 



005, H contr, = 0.000 



(26) body[0] 
f = 1513 H = 0.000, p = 
list 1513 1.00 



005, H contr. = 0.000 



(27) uparrow[0] 

f = 1260 H = 1.184, p = 

7 cases 

plus 822 ,65 

var 372 ,30 

num 58 .05 

(28) item[0] 

f = 1214 H = .936, p = 
9 cases 



004, H contr. 


s 


,005 


dollar 


4 


.00 


minus 


2 


.00 


dot 


1 


.00 


004, H contr. 


- 


.004 



register 



.00 
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relE 


1021 


.84 


in 


25 


.02 


relN 


2 


.00 


list 


97 


.08 


relL 


7 


.01 


relLE 


2 


.00 


1b1 


52 


.04 


relG 


7 


.01 


relGE 


1 


.00 


(29) item[*. 


@] 
















f = 1214 H = 


3.315 


. P = 


.004, H contr. 


s 


.014 








36 cases (19 shown 


) 














list 


425 


.35 


num 


26 


.02 


resume 


11 


.01 


assign 


187 


.15 


dollar 


20 


.02 


and 


10 


.01 


call 


170 


.14 


goto 


20 


.02 


exit 


10 


.01 


ifstrat 


102 


.08 


openstmt 


17 


.01 


dot 


7 


.01 


casestmt 


42 


.03 


caseexp 


14 


.01 


continue 


7 


.01 


nullstmt 


41 


.03 


signal 


12 


.01 


«others» 


45 


.04 


return 


37 


.03 


error 


11 


.01 








(30) casestnit[@] 
















f = 639 H = 


1.543 


. P = 


.002, H contr. 


s 


.003 








11 cases 


















dollar 


450 


.70 


nuffl 


17 


.03 


uparrow 


2 


.00 


var 


67 


.10 


assignx 


4 


.01 


inlinecall 


1 


.00 


dot 


55 


.09 


seqindex 


3 


.00 


index 


1 


.00 


call 


37 


.06 


minus 


2 


.00 








(31) casestmtC*,©] 
















f = 639 H = 


0.000 


. P = 


.002, H contr. 


= 


.000 








list 


639 1 


.00 














(32) casestm 


t[».*.e] 














f ' 639 H = 


2.590 


. P = 


.002, H contr. 


= 


.006 








15 cases 


















<einpty> 


269 


.42 


return 


22 


.03 


nullstmt 


7 


.01 


syserror 


151 


.24 


assign 


21 


.03 


goto 


4 


.01 


list 


61 


.10 


ifstmt 


13 


.02 


casestmt 


2 


.00 


signal 


35 


.05 


exit 


10 


.02 


openstmt 


1 


.00 


call 


32 


.05 


error 


10 


.02 


bump 


1 


.00 


(33) casetes 


t[0] 
















f = 572 H = 


.597 


. P = 


.002, H contr. 


= 


.001 








nuro 


489 


.85 


list 


83 


.15 








(34) casetes 


tc*.0] 
















f = 572 H = 


2.612 


. P = 


.002, H contr. 


s 


.005 








15 cases 


















list 


194 


.34 


nullstmt 


16 


.03 


openstmt 


4 


.01 


assign 


140 


.24 


str 


15 


.03 


exit 


3 


.01 


call 


112 


.20 


return 


11 


.02 


goto 


2 


.00 


num 


41 


.07 


ifexp 


6 


.01 


signal 


2 


.00 


ifstrat 


20 


.03 


casestmt 


5 


.01 


label 


1 


.00 


(35) addr[@] 


















f = 570 H = 


1.824 


. P = 


.002, H contr. 


s 


.004 








6 cases 


















var 


339 


.59 


uparrow 


68 


.12 


index 


32 


.06 


dot 


84 


.15 


dollar 


32 


.06 


dindex 


15 


.03 


(36) relN[e] 


















f = 538 H = 


2.411 


. P = 


.002, H contr. 


s 


.004 








16 cases 


















var 


209 


.39 


dindex 


6 


.01 


mod 


2 


.00 


dot 


157 


.29 


seqindex 


6 


.01 


fdollar 


1 


.00 


dollar 


69 


.13 


plus 


4 


.01 


uparrow 


1 


.00 


call 


37 


.07 


index 


3 


.01 


length 


1 


.00 


inlinecall 


24 


.04 


<empty> 


2 


.00 








assignx 


14 


.03 


minus 


2 


.00 








(37) relN[«, 


@] 
















f = 538 H = 


1.174 


. P = 


.002, H contr. 


= 


.002 








10 cases 


















num 


430 


.80 


call 


7 


.01 


inlinecall 


2 


.00 


var 


53 


.10 


addr 


5 


.01 


mwconst 


1 


.00 
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dot 24 


.04 


seqindex 


5 


.01 








dollar 8 


.01 


uparrow 


3 


.01 








(38) str[8] 
















f = 511 H = 8.735 


. P = 


.002, H contr. 


s 


.015 








449 cases (11 shown) 














"Break" 4 


.01 


"VM." 


3 


.01 


"error" 


3 


.01 


"Trace" 4 


.01 


" XXX" 


3 


.01 


"Error # " 


3 


.01 


" . xm" 3 


.01 


"NIL" 


3 


.01 


"New" 


2 


.00 


« __ H 3 


.01 


".XM" 


3 


.01 


«others» 


477 


.93 


(39) dostmt[0] 
















f = 484 H = 1.610 


. P ' 


.002, H contr. 


= 


.003 








4 cases 
















<empty> 219 


.45 


forseq 


84 


.17 








upthru 170 


.35 


downthru 


11 


.02 








(40) dostmt[«,@] 
















f = 484 H = 1.777 


. P = 


.002, H contr. 


= 


.003 








12 cases 
















<empty> 277 


.57 


relL 


10 


.02 


in 


3 


.01 


not 136 


.28 


and 


5 


.01 


var 


3 


.01 


re IN 28 


.06 


relG 


5 


.01 


relLE 


2 


.00 


relE 11 


.02 


relGE 


3 


.01 


or 


1 


.00 


(41) dostmt[*,*,@] 
















f = 484 H = 2.269 


. P = 


.002, H contr. 


= 


.004 








14 cases 
















list 253 


.52 


openstmt 


8 


.02 


enable 


3 


.01 


assign 62 


.13 


label 


7 


.01 


inlinecal 1 


1 


.00 


ifstmt 54 


.11 


bump 


4 


.01 


dostmt 


1 


.00 


casestmt 44 


.09 


signal 


3 


.01 


catchmark 


1 


.00 


call 40 


.08 


null stmt 


3 


.01 








(42) dostmt[*,*,»,0] 














f = 484 H = .133 


. P = 


.002, H contr. 


s: 


.000 








<enipty> 475 


.98 


item 


9 


.02 








(43) dostmt[*,«,*,* 


.6] 














f = 484 H = .292 


. P = 


.002, H contr. 


s: 


.000 








8 cases 
















<empty> 468 


.97 


ifstmt 


2 


.00 


goto 


1 


.00 


list 6 


.01 


return 


2 


.00 


exit 


1 


.00 


assign 3 


.01 


call 


1 


.00 








(44) minus[@] 
















f = 469 H = 2.672 


. P = 


.002, H contr. 


s 


.004 








18 cases 
















van 224 


.48 


call 


11 


.02 


uminus 


3 


.01 


dot 62 


.13 


div 


10 


.02 


length 


3 


.01 


<empty> 45 


.10 


minus 


8 


.02 


times 


2 


.00 


plus 38 


.08 


index 


7 


.01 


uparrow 


2 


.00 


dollar 25 


.05 


dindex 


4 


.01 


assignx 


1 


.00 


num 20 


.04 


ifexp 


3 


.01 


abs 


1 


.00 


(45) minus[*,6] 
















f = 469 H = 1.590 


. P = 


.002, H contr. 


= 


.003 








14 cases 
















num 318 


.68 


minus 


4 


.01 


ifexp 


1 


.00 


var 91 


.19 


times 


4 


.01 


assignx 


1 


.00 


dot 17 


.04 


index 


4 


.01 


div 


1 


.00 


dollar 15 


.03 


addr 


2 


.00 


inlinecall 


1 


.00 


plus 8 


.02 


call 


2 


.00 








(46) dindex[0] 
















f = 433 H = .195, 


. P = 


.001. H contr. 


s 


.000 








var 420 


.97 


dot 


13 


.03 








(47) dindex[*,@] 
















f = 433 H = 2.450, 


. P = 


.001, H contr. 


S 


.004 
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11 cases 


















var 


131 


.30 


minus 


21 


.05 


can 


3 


.01 


num 


102 


.24 


dot 


8 


.02 


uparrow 


2 


.00 


times 


89 


.21 


assignx 


4 


.01 


ifexp 


1 


.00 


plus 


68 


.16 


dollar 


4 


.01 









(48) not[@] 
f = 382 H = 2.636, p 
13 cases 



,001, H contr. =» 



003 



relE 


118 


.31 


in 


19 


.05 


and 


1 


.00 


var 


97 


.25 


relL 


5 


.01 


relN 


1 


.00 


call 


50 


.13 


relG 


5 


.01 


relGE 


1 


.00 


dot 


47 


.12 


or 


4 


.01 








dollar 


32 


.08 


ifexp 


2 


.01 








(49) assignx[0] 
















f = 323 H = 


1.353 


. P = 


.001, H contr. 


3 


.001 








7 cases 


















var 


223 


.69 


uparrow 


7 


.02 


dindex 


1 


.00 


dot 


68 


.21 


seqindex 


4 


.01 








dollar 


17 


.05 


index 


3 


.01 








(50) assignx[«,0] 
















-f = 323 H = 


3.313 


. P = 


.001, H contr. 


S 


.004 








22 cases (14 shown) 














call 


73 


.23 


minus 


19 


.06 


register 


6 


.02 


num 


71 


.22 


dollar 


19 


.06 


inlinecall 


3 


.01 


dot 


40 


.12 


dindex 


12 


.04 


seqindex 


3 


.01 


var 


30 


.09 


plus 


9 


.03 


relE 


2 


.01 


assignx 


22 


.07 


addr 


6 


.02 


<<others» 


8 


.02 


(51) index[8] 
















f « 338 H = 


.965 


. P " 


.001, H contr. 


= 


.001 








var 


265 


.78 


dollar 


42 


.12 


dot 


31 


.09 


(52) index[*, 


.»] 
















f = 338 H = 


2.325 


. P = 


.001, H contr. 


= 


.003 








11 cases 


















var 


106 


.31 


dollar 


13 


.04 


inlinecall 


2 


.01 


times 


85 


.25 


mod 


4 


.01 


ifexp 


1 


.00 


num 


85 


.25 


plus 


3 


.01 


div 


1 


.00 


minus 


36 


.11 


dot 


2 


.01 








(53) timesCQ] 
















f = 317 H = 


1.702 


. P = 


.001, H contr. 


= 


.002 








13 cases 


















var 


227 


.72 


plus 


9 


.03 


length 


2 


.01 


minus 


22 


.07 


inlinecall 


4 


.01 


ifexp 


1 


.00 


call 


18 


.06 


times 


3 


-01 


abs 


1 


.00 


dot 


16 


.05 


div 


2 


.01 








dollar 


10 


.03 


uminus 


2 


.01 








(54) times[*, 


n 
















f = 317 H = 


.535 


. P = 


.001, H contr. 


'= 


.001 








5 cases 


















num 


292 


.92 


dot 


8 


.03 


dollar 


1 


.00 


var 


10 


.03 


call 


6 


.02 








(55) inlineca11[@] 
















f = 262 H = 


3.019 


. P = 


.001, H contr. 


s 


.003 








18 cases 


















BITAND 


86 


.33 


Stop 


9 


.03 


PORTI 


2 


.01 


BITSHIFT 


42 


.16 


BITXOR 


7 


.03 


CONVERT 


1 


.00 


8 1 TOR 


39 


.15 


LDIVMOD 


4 


.02 


PUSH 


1 


.00 


DIVMOD 


26 


.10 


BLOCK 


4 


.02 


use 


1 


.00 


COPY 


17 


.06 


NovaOutLd 


3 


.01 


LongDiv 


1 


.00 


BITNOT 


16 


.06 


NovalnLd 


2 


.01 


LongMult 


1 


.00 


(56) inlinecan[*,0] 














f = 262 H = 


.843 


. P = 


.001, H contr. 


& 


.001 
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8 cases 



list 






230 


.88 


var 


7 


.03 


uparrow 


1 


.00 


num 






8 


.03 


dot 


4 


.02 


index 


1 


.00 


<empty> 




7 


.03 


inlinecall 


4 


.02, 








(57) 


and[@] 




















f = 


251 H 


= 


3.104 


• P = 


.001, H contr. 


3 


.003 








14 cases 




















relE 






60 


.24 


call 


• 12 


.05 


or 


5 


.02 


and 






51 


.20 


dot 


8 


.03 


relGE 


3 


.01 


relN 






37 


.15 


relL 


7 


.03 


in 


3 


.01 


vap 






32 


.13 


relG 


6 


.02 


relLE 


1 


.00 


not 






20 


.08 


dollar 


6 


.02 








(58) 


and[*. 


6] 


















f = 


251 H 


= 


3.113 


. P = 


.001, H contr. 


= 


.003 








17 cases 




















relE 






85 


.34 


or 


10 


.04 


caseexp 


2 


.01 


relN 






42 


.17 


dot 


10 


.04 


index 


2 


.01 


not 






28 


.11 


in 


8 


.03 


and 


1 


.00 


call 






22 


.09 


dollar 


7 


.03 


relGE 


1 


.00 


var 






12 


.05 


relL 


6 


.02 


fdollar 


1 


.00 


relG 






11 


.04 


relLE 


3 


.01 








(59) 


ifexp[@] 


















f = 


211 H 


= 


2.315 


. P = 


.001, H contr. 


=: 


.002 








12 cases 




















relE 






102 


.48 


relN 


6 


.03 


dollar 


4 


.02 


var 






54 


.26 


relL 


6 


.03 


or 


3 


.01 


in 






14 


.07 


dot 


6 


.03 


not 


3 


.01 


relG 






8 


.04 


and 


4 


.02 


relGE 


1 


.00 



(60) ifexp[«.@] 

f = 211 H = 2.433, p 
18 cases (11 shown) 
num 115 .55 
dot 25 .12 

call 18 .09 

var 18 .09 



001, H contr. 

dollar 
ifexp 
plus 
minus 



002 



9 


.04 


6 


.03 


3 


.01 


3 


.01 



str 


3 


.01 


index 


2 


.01 


memory 


2 


.01 


«others» 


7 


.03 



(61) ifex 


P[* 


*,@] 




f = 211 


H = 


3.005 


. P 


21 cases 








num 




95 


.45 


dot 




20 


.09 


var 




20 


.09 


call 




14 


.07 


ifexp 




13 


.06 


dollar 




9 


.04 


minus 




7 


.03 



001, H contr. 

plus 

str 

caseexp 

index 

dindex 

min 

not 



002 



6 


.03 


4 


.02 


3 


.01 


3 


.01 


3 


.01 


3 


.01 


2 


.01 



uminus 


2 


.01 


addr 


2 


.01 


relN 


1 


.00 


relL 


1 


.00 


times 


1 


.00 


inl inecall 


1 


.00 


constructx 


1 


.00 



(62) fextract[@] 
f = 196 H = 0.000, p 
list 196 1.00 



001, H contr. 



0.000 



(63) fextract[*,0] 

f = 196 H = .672, p 
call 166 .85 

(64) signal[@] 

f = 186 H = .524, p 
var 164 .88 



001, H contr. = .000 
inlinecall 28 .14 



001, H contr. = .000 
dot 22 .12 



signal 



,01 



(65) signal[*,@] 
f = 186 H = 1.602, p = 
5 cases 
<empty> 104 .56 
list 44 .24 



001, H contr. 



.001 



var 31 .17 

dollar 5 .03 



num 



01 



(66) signal[«,*,0] 
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f = 186 H = 0.000, p = .001, H contr. = 0.000 
<empty> 186 1,00 



(67) 
f = 


intCC[@] 

183 H = .483 


. P = 


.001, H 


contr. 


= 


.000 


4 
num 
var 


cases 

168 
12 


.92 
.07 


dot 
minus 




2 

1 


.01 
.01 


(68) 
f = 


intCC[*,@] 
183 H = 1.466 


. P = 


.001, H 


contr. 


_ 


.001 


7 
num 
var 
dot 


cases 

133 
19 
10 


.73 
.10 
.05 


plus 

div 

minus 




7 

7 
6 


.04 
.04 
.03 



26 


.14 


dollar 


3 


.02 


10 


.05 


index 


2 


.01 



dollar 1 .01 



(69) construct[@] 
f = 183 H = 1.817, p = .001, H contr. = .001 

6 cases 
var 90 .49 dot 

uparrow 52 .28 dindex 

(70) construct[*.@] 
f = 183 H = 0.000, p = .001, H contr. = 0.000 

list 183 1.00 

(71) unionx[e] 
f = 179 H = 0.000, p = .001, H contr. = 0.000 

var 179 1.00 

(72) unionx[*,@] 
f = 179 H = 0.000, p = .001, H contr. = 0.000 

list 179 1.00 

(73) upthru[@] 
f = 170 H = .767, p = .001, H contr. = .000 

var 132 .78 <empty> 38 .22 

(74) upthru[»,@] 
f = 170 H = 1.220, p = .001, H contr> = .001 

4 cases 
intCC 83 .49 intOO 3 .02 
intCO 81 .48 intOC 3 .02 

(75) rel6[@] 
f = 159 H = 2.138, p = .001, H contr. = .001 

12 cases 

var 94 .59 assignx 

dot 21 .13 plus 

dollar 10 .06 length 

<empty> 8 .05 index 

(76) relG[*,@] 
f = 159 H = 1.386, p = .001, H contr. = .001 

9 cases 

num 120 .75 

var 14 .09 

dot 11 .07 

(77) lbl[0] 

f = 159 H = .984, p = .001, H contr. = .001 

5 cases 

1 127 .80 3 4 .03 5 2 .01 

2 23 .14 , 4 3 .02 

(78) catchphrase[@] 

f = 130 H = 1.040, p = .000, H contr. = .000 
item 98 .75 <empty> 20 .15 list 12 .09 



8 


.05 


times 


1 


.01 


7 


.04 


inl inecall 


1 


.01 


5 


.03 


abs 


1 


.01 


2 


.01 


bumpx 


1 


.01 



dollar 


6 


.04 


ifexp 


1 


.01 


index 


3 


.02 


mod 


1 


.01 


minus 


2 


.01 


max 


1 


.01 



(79) catchphrase[*,0] 
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C.6. Final Pattern Set — Sorted by Pattern Number 

f = 130 H = 1.024, p = 

8 cases 
<empty> 109 .84 
goto 8 .06 
Inllnecall 4 .03 

(80) error[@] 
f ' 129 H = .065, p = 

var 128 .99 

(81) error[*,@] 
f = 129 H = 1.903, p = 

9 cases 
<empty> 59 .46 
var 47 .36 
list 9 .07 

(82) error[*,*,e] 
f = 129 H = .065, p = 

<empty> 128 .99 

(83) or[03 
f = 127 H = 2.955. p = .000. H contr. = .001 

11 cases 

relE 38 .30 

relN 21 .17 

not 20 . 16 

or 10 . 08 

(84) or[*,e] 
f = 127 H = 2.856, p = .000. H contr. = .001 

14 cases 



.000, H 


contr. 


s 


.000 








continue 

list 

call 


4 
2 
1 


.03 
.0? 
.01 


assign 
error 


1 
1 


.01 
.01 


.000, H 
dot 


contr. 


s 

1 


.000 
.01 








.000, H 


contr. 


s 


.001 








num 
dot 
str 




6 
3 
2 


.05 
.02 
.02 


ifexp 

plus 

addr 


1 
1 

1 


.01 
.01 
.01 


.000, H contr. 
catchphrase 


1 


.000 
.01 









relG 


8 


.06 


dot 


5 


.04 


var 


8 


.06 


call 


2 


.02 


and 


7 


.06 


asslgnx 


1 


.01 


relL 


7 


.06 









relE 


43 


.34 


relG 


5 


.04 


dollar 


2 


.02 


and 


25 


.20 


dot 


4 


.03 


relGE 


1 


.01 


relN 


21 


.17 


call 


3 


.02 


relLE 


1 


.01 


not 


10 


.08 


caseexp 


2 


.02 


In 


1 


.01 


var 


7 


.06 


relL 


2 


.02 








(85) goto[0] 
f = 105 H = 


0.000 


. P = 


.000, H contr. 


= 


.000 








Ibl 


105 1 


.00 














(86) inC©] 
f = 102 H = 


1.492 


. P = 


.000, H contr. 


s: 


.001 








6 cases 


















var 


57 


.56 


minus 


5 


.05 


plus 


1 


.01 


<empty> 


35 


.34 


dollar 


3 


.03 


call 


1 


.01 


(87) 1n[*,@] 
f = 102 H = 


.323 


. P = 


.000, H contr. 


. 


.000 








intCC 


96 


.94 


IntCO 


6 


.06 








(88) intCO[@] 
f = 94 H = 


.849 


. P = 


.000, H contr. 


s 


.000 








4 cases 


















num 


78 


.83 


dot 


4 


.04 








var 


11 


.12 


length 


1 


.01 








(89) intCO[*, 
f = 94 H = 


0] 
2.573 


. P = 


.000, H contr. 


s 


.001 








9 cases 


















var 


38 


.40 


plus 


10 


.11 


min 


3 


.03 


length 


15 


.16 


dollar 


7 


.07 


div 


2 


.02 


dot 


12 


.13 


minus 


5 


.05 


call 


2 


.02 


(90) relL[0] 
f = 93 H = 


2.197 


. P = 


.000, H contr. 


. 


.001 








9 cases 


















var 


51 


.55 


assignx 


9 


.10 


call 


2 


.02 


dollar 


10 


.11 


dot 


6 


.06 


index 


2 


.02 
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<einpty> 



C.6. 

9 



Final 

.10 



Pattern Set — 
plus 



Sorted by Pattern Number 
3 .03 div 



(91) relL[*,e] 

f = 93 H = 2.141, 
9 cases 
nuffl 46 

var 20 

dot 12 

(92) openstmt[0] 

f = 91 H = 0.000, 
<empty> 91 1, 

(93) openstrat[*,6] 

f = 91 H = 1.052, 
8 cases 

list 76 

enable 5 . 

label 4 



p = .000, H contr. 



49 
22 
13 



P 
00 



84 
05 
04 



dollar 
length 
div 



,000, H contr. 



.001 

6 .06 
4 .04 
2 .02 



0.000 



.000, H contr. = .000 



(94) div[9] 
f = 87 H = 

7 cases 
var 
plus 
dot 

(95) div[»,0' 
f = 87 H = 

4 cases 
num 
var 

(96) 
f » 
num 



2.183, p 



37 
23 
11 



43 
26 
13 



79 
4 

r'egister[6] 
87 H = 0.000, 
87 1. 



.91 
.05 



P 
00 



(97) caseexp[9] 

f = 84 H = 1.343, 
7 cases 
dollar 62 
var 12 

dot 4 

(98) caseexp[*,@] 

f = 84 H = 0.000, 
list 84 1. 

(99) caseexp[*,*,S] 
f = 84 H = .885, 

7 cases 

num 73 

call 3 

ifexp 2 

(100) forseq[0] 

f = 84 H = 0.000, 
var 84 1. 

(101) forseq[*,@] 

f = 84 H = 2.496, 
10 cases 

var 30 . 

dot 25 

call 10 

dollar 4 

(102) forseq[*,^,e] 

f = 84 H = 1.745, 
5 cases 



74 
14 
05 



P 
00 



87 
04 
02 



P 
00 



36 
30 
12 
05 



casestmt 

inlinecall 

assign 



.000. H contr. 



dollar 

call 

times 



,572, p = .000, H contr, 



dot 
times 



2 .02 
1 .01 
1 .01 



.001 

6 .07 
6 .07 
3 .03 



.000 

3 .03 
1 .01 



,000, H contr. » 0.000 



p = .000, H contr. = 



call 

inlinecall 

dindex 



.000, H contr. 



.000, H contr, 



dot 
str 
dindex 



.000 

2 .02 
2 .02 
1 .01 



0.000 



000 

.02 
.02 
.01 



.000, H contr. = 0.000 



p = .000. H contr. 



001 



index 
num 

dindex 
plus 



05 
05 
04 
02 



p = .000, H contr, 



000 



times 
index 
min 



ifstmt 
dostmt 



minus 



num 



var 



minus 
div 



01 



.01 
.01 
.01 



01 
01 



01 



01 



.01 



.01 
.01 
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Appendix C: Detailed Pattern Data 



C.6. Final Pattern Set — Sorted by Pattern Number 

dot 49 .58 var 10 ,12 minus 3 .04 

call 14 .17 plus 8 .10 

(103) seqinClex[@] 

f = 82 H = .592, p = .000, H contr. = .000' 
vap 72 .88 dot 9 .11 dindex 1 .01 

(104) seqindex[*,0] 

f = 82 H = 1.736. p = .000, H contr. = .000 

6 cases 

var 52 .63 dot 7 .09 butnpx 4 .05 

minus 10 .12 num 7 .09 plus 2 .02 

(105) caseswitch[@] 

f = 81 H = 1.175, p = .000, H contr. « .000 
minus 45 .56 <empty> 33 .41 plus 3 .04 

(106) caseswitch[*,0] 

f = 81 H = 0.000, p = .000, H contr. = 0.000 
num 81 1.00 

(107) caseswitch[*,*,e] 

f = 81 H = 0.000, p = .000, H contr. = 0.000 
list 81 1.00 

(108) arraydesc[e] 

f = 80 H = 0.000, p = .000, H contr. = 0.000 
list 80 1.00 

(109) constructx[0] 

f = 77 H = 0.000, p = .000. H contr. = 0.000 
temp 77 1.00 

(110) constructx[*,@] 

f = 77 H = 0.000, p = .000, H contr. = 0.000 
list 77 1.00 

(111) length[0] 

f = 50 H = .529, p = .000, H contr. = .000 
var 44 .88 dot 6 .12 

(112) row[@] 

f = 47 H = 0.000, p = .000, H contr. = 0.000 
list 47 1.00 

(113) rowcons[@] 

f = 47 H = 0.000, p = .000, H contr. = 0.000 
var 47 1.00 

(114) rowcons[*,@] 

f = 47 H = 0.000, p = .000, H contr. = 0.000 
row 47 1.00 

(115) mwconst[6] 

f = 44 H = 0.000, p = .000, H contr. = 0.000 
list 44 1.00 

(116) resume[@] 

f = 44 H = .774, p = .000, H contr. = .000 
4 cases 

<empty> 38 .86 dot 2 .05 

var 3 .07 list 1 .02 

(117) relGE[0] 

f = 41 H = 1.954, p = .000, H contr. = .000 

7 cases 

var 23 .56 dollar 3 .07 plus 1 .02 

bumpx 8 .20 assignx 2 .05 
dot 3 .07 <empty> 1 .02 
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Appendix C: Detailed Paitern Data 



C.6. Final Pattern Set ~ Sorted by Pattern Number 



(118) relGE[*.@] 

f = 41 H » 2.022, p = 

7 cases 

num 18 .44 

var 13 .32 

length 5 .12 

(119) label[03 

f = 41 H = 1.883, p = 
. 7 cases 

list 25 .61 
enable 5 .12 
casestmt 4 .10 

(120) label[*.e3 

f = 41 H = .281, p » 
ttem 39 .95 

(121) uminus[0] 

f = 35 H = 1.391, p = 
5 cases 

var 25 .71 

dollar 4 .11 

(122) vconstruct[9] 

f = 32 H = 0.000, p = 
dollar 32 1.00 

(123) vconstruct[«,e] 

f = 32 H = 0.000. p = 
list 32 1.00 

(124) relLEC©] 

f « 30 H = 1.826, p = 

8 cases 

var 20 .67 

<empty> 2 .07 
assignx 2 .07 

(125) relLE[*,0] 

f = 30 H = 1.505. p = 

5 cases 

num 20 .67 

var 5 . 17 

(126) enable[S] 

f = 29 H = 0.000, p « 
catchphrase 29 1.00 

(127) enable[*,@] 

f^ 29 H = .431, p = 
list 27 .93 

(128) modC®] 

f = 27 H = 2.203, p - 

6 cases 

var 12 .44 

plus 5 .19 

(129) mod[»,e] 

f = 27 H = .979, p = 
num 20 .74 

(130) min[e] 

f = 27 H = 0.000, p = 
list 27 1.00 

(131) stringinit[@] 

f = 27 H = 0.000, p = 
num 27 1.00 



000. H contr. 

dot 

assignx 

dollar 



.000 

2 .05 
1 .02 
1 .02 



max 



,000, H contr. « .000 



ifstmt 

catchmark 

call 



,000, H contr. 
list 



.000, H contr. 



3 .07 
2 .05 
1 .02 



.000 
2 .05 



assign 



.000 



call 
dot 


3 .09 
2 .06 


000, H contr. 


= 0.000 


000, H contr. 


= 0.000 



dindex 



.000, H contr. = 



.000 



.000, H contr. = 0.000 



.000, H contr. 



0.000 



02 



,02 



03 



abs 


2 


.07 


dollar 


1 


.03 


times 


1 


.03 


index 


1 


.03 


dot 


1 


.03 








000, H contr. 


s 


.000 








dot 


2 


.07 


dollar 


1 


.03 


index 


2 


.07 









.000, K contr. 
ifstmt 


1 


.000 
.03 


dostmt 


1 


.03 


.000, H contr. 


= 


.000 








dot 
inlinecall 


3 
3 


.11 
.11 


bumpx 
dollar 


3 
1 


.11 
.04 


.000, H contr. 
length 


6 


.000 
.22 


var 


1 


.04 



,000. H contr. = 0.000 
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(132) max[9] 

f = 25 H = 0.000. p = .000, H contr. = 0.000 
list 25 1.00 

(133) catchmark[@] 

f = 24 H = 1.908, p = .000, H contr. = .000 
5 cases 

call 10 .42 enable 6 .25 ifstmt 1 .04 

assign 6 .25 fextract 1 .04 

(134) base[e] 

f = 22 H = .439, p = .000, H contr. = .000 
var 20 .91 dot 2 .X)9 

(135) inemory[0] 

f = 13 H = 1.884, p = .000, H contr. = .000 
4 cases 

var 5 .38 plus 2 .15 

num 4 .31 minus 2 .15 

(136) f(lollar[@] 

f = 11 H = .845, p = .000, H contr. = .000 
call 8 .73 inlinecall 3 .27 

(137) fdollar[»,0] 

f = 11 H = 0.000, p = .000, H contr. = 0.000 
var 11 1.00 

(138) downthru[e] 

f = 11 H = 0.000, p = .000, H contr. = 0.000 
var 11 1.00 

(139) d6wnthru[»,0] 

f = 11 H = .946, p = .000, H contr. = .000 
intCO 7 .64 intCC 4 .36 

(140) dst[0] 

f = 10 H = 0.000. p = .000, H contr. = 0.000 
var 10 1.00 

(141) lstf[e] 

f = 10 H = 0.000, p = .000, H contr. = 0.000 
var 10 1.00 

(142) extractC®] 

f = 9 H = 0.000, p = .000, H contr. = 0.000 
list 9 1.00 



(143) extract[*,6] 



f = 


9 H = 


1.224, 


. P = 


.000, H contr. 


= 


.000 


call 




6 


.67 


uparrow 


2 


.22 


(144) 


lst[@] 












f = 


9 H = 


0.000, 


. P = 


.000, H contr. 


= 


.000 


var 




9 1. 


,00 








(145) 


abs[0] 












f = 


8 H = 


1.299, 


. P = 


.000, H contr. 


= 


.000 


var 




5 


.62 


call 


2 


.25 


(146) 


start[@] 












f = 


8 H = 


.811, 


. P = 


.000, H contr. 


= 


.000 


var 




6 


.75 


dot 


2 


.25 



dollar 1 .11 



dot 1 .12 



(147) start[*,0] 
f = 8 H = 0.000. p = .000, H contr. = 0.000 
<empty> 8 1.00 
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G.6. Final Pattern Set — Sorted by Pattern Number 

(148) startC*,*,©] 

f = 8 H = .544, p = .000. H contr. = .000 

<empty> 7 .87 catchphrase 1 .12 

(149) intOO[0] 

f = 3 H = 0.000, p = .000, H contr. = 0.000 

var 3 1.00 

(150) 1nt00[*,9] 

f = 3 H » .918, p = .000, H contr. * .000 

var 2 .67 plus 1 .33 

(151) intOCC®] 

f = 3 H = 0.000. p = .000, H contr. = 0.000 

var 3 1.00 

(152) iBtOC[*,e] 

f = 3 H = 0.000. p = .000. H contr. = 0.000 

var 3 1.00 



(153) stop[83 
f = 3 H = 0.000. p 
<empty> 3 1.00 



000, H contr. 



0.000 



(154) svcC®] 
f = 2 H = 0.000, p = .000, H contr. = 0.000 
num 2 1.00 



(155) var[global,e] 

f = 5332 H = 3.411, p » 

SEE ALSO: f H p 

(213) 752 4.743 .003 

(214) 4016 4.241 .014 

(215) ' 1009 2.251 .003 

(216) 944 4.912 .003 

(217) 164 3.938 .001 
134 cases (12 shown) 

2396 .45 

1 925 .17 

2 361 .07 

3 236 .04 

4 198 .04 



.018, H contr. = .062 

contr 

.012, assign[var[global ,0]] 

.058, ca11[var[gTobal ,@i] 

.008, dot[var[global ,@]] 

.016, dot[*,var[g1obal,0]] 

.002. signal[var[global ,§3] 



5 

6 

7 

40 

-10 



149 

111 

75 

59 

42 



03 
.02 
01 
01 
.01 



44 


31 


.01 


12 


27 


.01 


<<others>> 


722 


.14 



(156) var[loca 

f = 2335 H = 

SEE ALSO: f 

(196) 3296 

(197) 610 

(198) 1445 

(199) 1929 



(201) 
(202) 
(203) 
(204) 
(205) 
(206) 
(207) 
(208) 
(209) 
(210) 
(211) 
(212) 
z/ cas 

1 
2 
3 
4 
5 



84 

1909 

199 

90 

321 

3027 

162 

264 

245 

853 

191 

355 

129 

es (15 



i.e] 

3.153, 
H 
2.995 
2.949 
2.275 
1.384 

2.798 
2.803 
2.595 
2.684 



795 
598 
453 
364 
607 
3.006 
1.130 
2.023 
shown 
580 
503 
379 
217 
160 
127 



P = 
P 

.011 
.002 
.005 
.007 

nn A 
. U\J\J 

.006 
.001 
.000 
.001 
.010 
.001 
.001 
.001 
.003 
.001 
.001 
.000 

I 

25 
22 
16 
09 
07 
05 



.008, 
contr 
. 033 , 
.006, 
.011, 
.009, 
. 000 , 
.018, 
.002, 
.001, 
.003, 
.018, 
.001, 
.003, 
.002, 
.008, 
.002, 
.001, 
.001, 

7 

6 

8 

9 

14 

13 



H contr. 



.025 



{assign|assignx}[var[loca1 ,0]] 

{assign|assignx}[*,var[local ,@]] 

can[*,var[local ,0]] 

dot[var[local ,0]] 

fopseq[var[local ,@jj 

list[*. . . .var[local ,0]] 

minus[var[local ,0]] 

minus[^ ,var[ local ,0]] 

plus[var[local ,0]] 

plus[*,var[local ,0]] 

times[var[local ,0]] 

{index|dindex|seqindex}[var[local ,03] 

{indexjdindex j seqindex}[*, var [local ,0]] 

{relL|relLE|relE|relN|relGE|relG}[var[local,0]] 

{relL|relLE|relE|relN|re1GE|relG)[*,var[local,0]] 

uparrow[varilocal ,0]] 

upthru[var[local ,0]] 



16 


.05 


63 


.03 


40 


.02 


35 


.01 


19 


.01 


16 


.01 



12 


14 


.01 


11 


13 


.01 


10 


12 


.01 


<<others» 


41 


.02 
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C.6. Final Pattern Set — Sorted by Pattern Number 



(157) var[field,e] 
f = 428 H = 1.491, 
SEE ALSO: f H 

(242) 2709 2.433 

(243) 4570 3.441 
6 cases 

5 193 .45 

179 .42 



= .001, H contr. = .002 
p contr 
009 .022, donarC*,var[fielti,9]] 
015 .053, dotC*.var[field.e]] 



52 
2 



.12 
.00 



00 
00 



(158) var[entry.9] 
f = 2686 H = 4.039, p 
54 cases (24 shown) 



.009, H contr. 



,037 



1 


651 


.24 


10 


67 


.02 


20 


29 


.01 


2 


350 


.13 


11 


57 


.02 


21 


29 


.01 


3 


267 


.10 


14 


54 


.02 


18 


24 


.01 


4 


190 


.07 


13 


53 


.02 


23 


22 


.01 


5 


155 


.06 


12 


52 


.02 


22 


19 


.01 


6 


130 


.05 


16 


39 


.01 


24 


14 


.01 


7 


111 


.04 


15 


34 


.01 


«others» 


98 


.04 


9 


91 


.03 


17 


32 


.01 








8 


88 


.03 


19 


30 


.01 









(159) var[global, ♦,♦,©] 

f = 12217 H = 1.114, p 

43 cases (11 shown) 



,041, H contr. = 



046 



16 


10530 .86 


11 


60 .00 


1584 


48 


.00 


14 




362 .03 


12 


59 .00 


64 


44 


.00 


32 




354 .03 


3 


54 .00 


15 


42 


.00 


1 




137 .02 


144 


48 .00 


«others» 


379 


.03 


(160) var| 


[local, ♦,♦,©] 












f = 6685 


H = 


2.327, p = 


.023, 


H contr. = .053 








SEE ALSO": 


f 


H p 


contr 










(229) 


3114 


2.078 .011 


.022, 


assign[var[local ,♦,♦ 


.0]] 






(230) 


590 


2.026 .002 


.004, 


as sign[*,var[ local ,* 


.♦.8]] 






(231) 


182 


1.963 .001 


.001, 


as signx[var[ local ,♦, 


♦.0]] 






(232) 


1445 


1.463 .005 


.007, 


call[*,var[local ,♦,* 


.0]] 






(233) 


840 


2.252 .003 


.006, 


dollar[var[local ,♦,* 


.0]] 






(234) 


1929 


0.000 .007 


0.000, 


dot[var[local,*,*,@]] 






(235) 


45 


0.000 .000 


0.000, 


error[*,var[ local ,♦, 


».0]] 






(236) 


113 


0.000 .000 


0.000, 


ifstmt|;var[local ,♦,♦ 


.0]] 






(237) 


1909 


2.104 .006 


.014, 


1 ist[*, . . . var[local , 


•.♦.0]] 






(238) 


58 


0.000 .000 


0.000, 


seqindex[var[local ,* 


.♦.0]] 






(239) 


50 


.242 .000 


.000, 


seqindex[*,var[ local 


.*,*.0]] 






(240) 


355 


0.000 .001 


0.000, 


uparrow[var[local ,*, 


♦.0]] 






(241) 


129 


1.400 .000 


.001, 


upthru[var[local ,*,* 


.0]] 






34 cases (12 


shown) 












14 


3135 .47 


8 


95 .01 


3 


53 


.01 


16 


2168 .32 


11 


94 .01 


176 


39 


.01 


1 




231 .03 


48 


69 .01 


<<others» 


269 


.04 


15 




210 .03 


4 


65 .01 








32 




199 .03 


9 


58 .01 








(161) var| 


[field, ♦,*,@] 












f = 428 


H = 


1.749, p = 


.001, 


H contr. = .003 








SEE ALSO: 


f 


H P 


contr 










(267) 


4570 


2.548 .015 


.039. 


dot[*,var[field,*,*,( 


9]] 






(268) 


2709 


3.287 .009 


.030, 


dollar[*,var[field,* 


.*.0]] 






7 cases 



















179 .42 


1 


5 .01 


14 


1 


.00 


16 




145 .34 


3 


5 .01 








32 




90 .21 


15 


3 .01 








(162) var| 


'entry,*,*, 6] 












f = 2686 


H = 


0.000. p = 


.009, 


H contr. = 0.000 









(163) dot[*,var[*,*.@]] 
f = 5697 H = .657, p 



019, H contr. = 



013 
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C.6. Final Pattern Set — Sorted by Pattern Number 



15 cases 



5256 
74 
62 
53 
50 



.92 
.01 
.01 
.01 
.01 



(164) donar[*,var[*,*,0]] 
f = 2754 H = 1.713, p = 

15 cases 

1904 .69 

2 364 .13 

3 186 .07 

1 56 .02 
14 53 .02 

(165) unionx[var[*,*,§]} 
f =• 179 H = 2.818, p = 

16 cases 

1 45 . 25 

3 43 .24 
32 .18 

2 32 .18 

4 4 .02 
14 4 . 02 

(166) addr[var[0]] 

f = 339 H = .958. p = 
global 210 .62 

(167) assign[var[e]] 

f = 3885 H = .752, p = 
local 3114 .80 

(168) ass1gn[*,var[8]] 

f = 775 H = .959, p = 
4 cases 
local 590 .76 
global 155 .20 

(169) assignx[var[@]3 

f = 223 H = .688, p = 
local 182 .82 

(170) assignx[»,var[@]] 

f = 30 H = 1.159, p = 
local 20 .67 

(171) call[var[§]] 

f = 6370 H = .992, p = 
global 4016 .63 

(172) call[*,var[e]] 

f = 1739 H = .806, p = 
4 cases 
local 1445 .83 
global 238 .14 

(173) dot[var[@]] 

f = 2941 H = .939, p = 
local 1929 .66 

(174) dot[*.varr0]] 

f = 5697 H = '.855, p = 
4 cases 
field 4570 .80 
global 944 .17 

(175) dollar[var[@]] 

f = 964 H = .682. p = 



14 47 

4 36 
2 35 
9 24 

5 23 



009, H contr. = 

12 50 

4 31 

5 28 
8 21 
15 21 



,001, H contr. =» 

5 3 

7 2 

8 2 

9 2 

10 2 

11 2 



.001, H contr. 
local 



.013, H contr. 
global 



129 



752 



003, H contr. » 

entry 28 
field 2 



001, H contr. 
global 



,000, H contr, 
global 



022, H contr. 
entry 



.006, H contr. 

field 
entry 



010, H contr. 
global 



019, H contr, 

entry 
local 



003, H contr, 



41 



Z3iy 



33 
23 



.01 
.01 
.OX 
.00 
.00 



016 

.02 
.01 
.01 
.01 
.01 



002 

.02 
.01 
.01 
.01 
.01 
.01 



,001 
.38 



,010 
.19 



,003 

.04 
.00 



001, 
.18 



000 
.27 



,021 
.36 



005 

.02 
.01 



.009 
1009 .34 



.017 

171 .03 
12 .00 



10 

15 

7 

11 

12 



6 

11 

7 

13 

9 



12 
13 
6 
16 



field 



entry 



local 



field 



11 


.00 


11 


.00 


9 


.00 


5 


.00 


1 


.00 



12 


.00 


11 


.00 


9 


.00 


6 


.00 


2 


.00 



2 


.01 


2 


.01 


1 


.01 


1 


.01 



19 



3& 



00 



07 



,U1 



,00 



,002 
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local 



840 



87 



field 



63 



.07 



global 



61 



06 



(176) clollar[»,var[@]] 

f = 2754 H = .126, p = 
field 2709 .98 

(177) error[var|[@]] 

f = 128 H = 0.000, p = 
global 128 1.00 

(178) signal[var[@]] 

f = 164 H = 0.000, p = 
global 164 1.00 



009, H contr, 
global 



000, H contr. 



42 



,001 
.02 



0.000 



001, H contr. = 0.000 



local 



00 



(179) {relE|relG|relN|relL|relGE|relLE>[var[@]] 
f = 955 H = .549, p = .003, H contr. = .002 
local 853 .89 global 89 .09 



field 



13 



01 



(180) {relE|relG|relN|relL|relGE|relLE}[*,var[@]] 



403 H = .998, p = .001, H contr. = 



global 



212 



53 



local 



191 



001 
,47 



(181) return[var[@]] 

f = 337 H = .692, p = 
local 292 .87 

(182) uparrow[var[@]] 

f = 3 72 H = .268, p = 
local 355 .95 

(183) upthru[var[@]] 

f = 132 H = .156, p = 
local 129 .98 



.001, H contr, 
global 



001, H contr. 
global 



.000. H contr. 
global 



29 



17 



,001 
.09 



000 
.05 



.000 
3 .02 



field 



16 



05 



( 184) {index|dindex|seqindex}[var[@]] 

f = 757 H = 1.317, p = .003, H contr, 
global 423 .56 local 

( 185) {index|dindex|seqindex}[*,var[@]] 

f = 289 H = .615, p = .001, H contr. 
local 245 .85 



(186) assign[*,num[@]] 

f = 1078 H = 3.056, p = 

79 cases (11 shown) 

431 .40 

1 282 .26 
-1 81 .08 
16383 58 .05 



264 



003 
.35 



001 



global 




44 


.15 


.004, H 


contr. 


s 


.011 


65535 




51 


.05 


2 




26 


.02 


17 




11 


.01 


32767 




11 


.01 



field 



70 



,09 



3 


10 


.01 


16 


8 


.01 


4 


5 


.00 


<<otliers>> 


104 


.10 



(187) index[«,num[@]] 

f = 85 H = 1.606, p = 

7 cases 

54 .64 

1 17 .20 

2 8 .09 



.000, H contr. 



000 



-65536 


2 


.02 


-65535 


2 


.02 


8 


1 


.01 



1788 



01 



(188) 


intCC[num[@]] 




f = 


168 H = 


2.569 


. P 


19 


cases 











79 


.47 


1 




42 


.25 


48 




8 


.05 


2 




6 


.04 


97 




6 


.04 


-505 




4 


.02 


65 




4 


.02 



001, H contr. = 

18 

-507 

-506 

-325 

3 

34 

-327 



001 



3 


.02 


-326 


2 


.01 


7 


2 


.01 


8 


2 


.01 


24 


2 


.01 


65408 


2 


.01 




1 


.01 





01 
01 
01 
01 
01 



(189) intCO[num[@]] 
f = 78 H = .477, p 



.000, H contr. 



000 
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70 .90 1 8 .10 



(190) 
f = 


minus[*,num[@]] 
318 H = 2.446. p = 


.001, 


H 


contr. 


_ 


.003 








26 


cases (17 shown) 


















1 


195 .61 


4 






6 


.02 


16 


3 


.01 


2 


34 .11 


48 






6 


.02 


35 


3 


.01 


3 


12 .04 


107 






5 


.02 


112 


3 


.01 


63 


10 .03 


5 






4 


.01 


18 


2 


.01 


6 


8 .03 


32 






4 


.01 


61 


2 


.01 


7 


8 .03 


65535 




4 


.01 


<<others» 


9 


.03 


(191) 
f = 


plus[*,num[@]] 

316 H = 2.807, p = 


.001, 


H 


contr. 


. 


.003 








26 


cases (19 shown) 


















1 


147 .47 


6 






5 


.02 


31 


2 


.01 


2 


58 .18 


16 






4 


.01 


48 


2 


.01 


3 


31 .10 


32 






4 


.01 


119 


2 


.01 


4 


15 .05 


20 






3 


.01 


259 


2 


.01 


15 


10 .03 


64 






3 


.01 


512 


2 


.01 


5 


8 .03 


256 






3 


.01 


«others» 


7 


.02 


255 


6 .02 


19 






2 


.01 








(192) 
f = 


tinies[* ,num[@]] 
292 H = 2.387, p = 


.001, 


H 


contr. 


s 


.002 








17 


cases 


















2 


133 .46 


8 






4 


.01 


11 


2 


.01 


3 


71 .24 


12 






4 


.01 


6 


1 


.00 


16 


42 .14 


15 






4 


.01 


9 


1 


.00 


256 


10 .03 


1 






3 


.01 


20 


1 


.00 


127 


5 .02 


10 






3 


.01 


64 


1 


.00 


4 


4 .01 


512 






3 


.01 








(193) 
f = 


register[num[@]] 
87 H = 2.221, p = 


.000. 


tt 


contr. 


s 


.001 








5 


cases 


















253 


29 .33 


2 






17 


.20 


1 


10 


.11 


3 


19 .22 


254 






12 


.14 









(194) {relL|relLE|relE|relN|re1GE|relG}[*,num[@]] 



f = 2610 H = 4.744, p = 
152 cases (25 shown) 

625 .24 

1 254 .10 
16383 245 .09 
3 197 .08 
-1 164 .06 

2 147 .06 
65535 93 .04 
7 49 .02 
13 45 .02 

(195) return[num[@]] 



009, H contr. = .042 



4 

16 

5 

10 

12 

9 

14 

8 

15 



f = 148 H = 
9 cases 

1 
32767 



2.227, p 

65 .44 
34 .23 
25 .17 



( 196) {assign | ass ignx}[var[ local ,@]] 

f = 3296 H = 2.995, p = .011, H contr, 

27 cases (12 shown) 

868 .26 5 

1 762 .23 6 

2 500 .15 7 

3 386 .12 8 

4 226 .07 9 

(197) {assign|assignx}[*,var[local ,@]] 

f = 610 H = 2.949, p = .002, H contr. 



39 
35 
28 
28 
27 
25 
25 
21 
21 



.001. H contr. = 

16383 9 

-1 7 

65535 3 



170 

107 

79 

48 

34 



.01 
.01 
.01 
.01 
.01 
.01 
.01 
.01 
.01 



001 

.06 
.05 
.02 



033 

.05 
.03 
.02 
.01 
.01 



006 



32767 

2047 

11 

32 

66 

6 

63 

<<others>> 



2047 

49151 

2 



19 
17 
15 
15 
15 
14 
14 
433 



,01 
,01 
,01 
,01 
,01 

01 
,01 

17 



2 


.01 


2 


.01 


1 


.01 



10 


21 


.01 


11 


17 


.01 


<<others>> 


78 


.02 
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22 cases (11 shown) 






186 .30 


4 






50 


.08 


8 


7 


.01 


1 


122 .20 


5 






37 


.06 


9 


5 


.01 


2 


88 .14 


6 






25 


.04 


10 


4 


.01 


3 


59 .10 


7 






12 


.02 


<<others» 


15 


.02 


(198) 


can[*,var[local,6]] 


















f = 


1445 H = 2.275, p = 


.005, 


H 


contr. 


s 


.011 








18 


cases (11 shown) 





















647 .45 


4 






53 


.04 


10 


4 


.00 


1 


363 .25 


5 






38 


.03 


9 


3 


.00 


2 


175 .12 


7 






21 


.01 


11 


3 


.00 


3 


113 .08 


6 






15 


.01 


«others» 


10 


.01 


(199) 


dot£var[local ,0]] 


















f = 


1929 H = 1.384, p = 


.007, 


H 


contr. 


a 


.009 








13 


cases 





















1365 .71 


5 






8 


.00 


21 


2 


.00 


1 


340 .18 


6 






5 


.00 


7 


1 


.00 


2 


115 .06 


22 






3 


.00 


9 


1 


.00 


3 


63 .03 


8 






2 


.00 








4 


22 .01 


20 






2 


.00 








(200) 


forseqtvar[1ocal ,0]] 


















f = 


84 H = .963, p = 


.000, 


H 


contr. 


= 


.000 








4 


cases 





















67 .80 


2 






5 


.06 








1 


11 .13 


6 






1 


.01 








(201) 


list[*, . . .var[ local ,0]] 
















f = 


1909 H = 2.798, p = 


.006, 


H 


contr. 


= 


.018 








23 


cases (11 shown) 





















578 .30 


4 






133 


.07 


8 


19 


.01 


1 


421 .22 


5 






85 


.04 


9 


14 


.01 


2 


313 .16 


6 






47 


.02 


10 


13 


.01 


3 


227 .12 


7 






30 


.02 


<<others» 


29 


.02 


(202) 


minus[var[1oca1 ,0]] 


















f = 


199 H = 2.803, p = 


.001, 


H 


contr. 


= 


.002 








13 


cases 


















2 


49 .25 


6 






11 


.06 


9 


1 


.01 


1 


48 .24 


5 






5 


.03 


11 


1 


.01 





39 .20 


7 






4 


.02 


16 


1 


.01 


3 


24 .12 


8 






3 


.02 








4 


11 .06 


10 






2 


.01 








(203) 


mi nus[*,var{ local ,0]] 


















f = 


90 H = 2.595, p = 


. 000 , 


H 


contr. 


= ■ 


.001 








10 


cases 


















1 


33 .37 


5 






6 


.07 


7 


1 


.01 





15 .17 


9 






3 


.03 


8 


1 


.01 


2 


14 .16 


4 






2 


.02 








3 


13 .14 


6 






2 


.02 








(204) 


plus[var[local ,0]] 


















f = 


321 H = 2,684, p = 


.001. 


H 


contr. 


= 


.003 








15 


cases 





















102 .32 


4 






9 


.03 


16 


2 


.01 


1 


74 .23 


6 






9 


.03 


10 


1 


.00 


3 


54 .17 


7 






5 


.02 


14 


1 


.00 


2 


38 .12 


8 






2 


.01 


18 


1 


.00 


5 


20 .06 


9 






2 


.01 


23 


1 


.00 


(205) 


plus[*,var[local ,0]] 


















f = 


3027 H = 1.795, p = 


.010, 


H 


contr. 


s 


.018 








15 


cases 





















1852 .61 


5 






29 


.01 


17 


2 


.00 


1 


565 .19 


6 






27 


.01 


10 


1 


.00 


2 


278 .09 


7 






25 


.01 


12 


1 


.00 


3 


150 .05 


8 






5 


.00 


14 


1 


.00 
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86 



.03 



00 



23 



GO 



(206) times[var[1ocar,§]] 

f = 162 H = 2.598, p = 

11 cases 

66 .41 

1 30 . 19 
4 17 .10 

2 16 .10 



001, 


H contr. 


s 


.001 , 




3 




10 


.06 


10 


6 




8 


.05 


15 


7 




7 


.04 


17 


5 




5 


.03 





.01 
.01 
.01 



(207) {index|dindex|seqinclex}[var[Tocal,@]] 



f = 264 H = 3.453, p 
21 cases (13 shown) 



51 
37 
36 
29 
27 



.19 
.14 
.14 
.11 
.10 



.001, H contr. = .003 



5 

18 

3 

9 

11 



27 

12 

10 

8 

7 



.10 
.05 
.04 
.03 
.03 



16 

6 

28 

<<others» 



.03 
01 
.01 
.03 



(208) {index|dindex|seqindex}[*,var[local ,0]] 
f = 245 H = 2.364, p = .001, H contr. = 



12 cases 



95 
69 
35 
19 



.39 
.28 
.14 
,08 



10 
6 
3 
2 



.002 



.04 
.02 
.01 
.01 



11 
16 
7 
8 



.01 
.01 
.00 
,00 



(209) 
f = 



{re1L|relLE|relE|relN|relGE(rel6}[varC1ocal,03] 



853 H 



16 cases 



2.607, p = .003, H contr. 



313 

189 

134 

71 

57 

31 



.37 
.22 
.16 
.08 
.07 
.04 



6 

8 

7 

9 

10 

15 



20 

12 

10 

6 

3 

3 



008 



.02 
.01 
.01 
.01 
.00 
.00 



12 
13 
16 
20 



.00 
00 
.00 
.00 



(210) {relL|relLE|re1E|relN|relGE|relG}[*,var[local,®]] 
f = 191 H = 3.006, p = .001, H contr. = .002 
13 cases 



46 
33 
30 
29 
17 



.24 
.17 
.16 
.15 
.09 



6 
5 
7 
8 
10 



11 
8 
6 
6 
2 



.06 
.04 
.03 
.03 
.01 



11 
16 
68 



.01 
01 
.01 



(211) uparrow[var[local ,9]] 
f = 355 H = 1.130, p = .001, H contr. = 



8 cases 



281 
39 
17 



.79 
.11 
.05 



11 
3 
2 



.001 

.03 
.01 
.01 



.00 
.00 



(212) 

f = 

9 



1 

2 



upthru[var[loca1 ,6]] 
129 H = 2.023, p = 



cases 



67 
30 
15 



.52 
.23 
.12 



.000, H contr, 

3 
4 
6 



.001 

5 .04 
5 .04 
4 .03 



5 
9 
11 



01 
01 
01 



(213) assign[var[global,@]] 

f = 752 H = 4.743, p = .003, H contr. = .012 

95 cases (31 shown) 

1 100 .13 30 6 .01 28 
85 .11 45 6 .01 32 

3 79 .11 63 6 .01 33 

2 68 .09 23 5 .01 38 

4 61 .08 31 5 .01 40 

5 59 .08 35 5 .01 42 

6 41 .05 37 5 .01 43 





.01 




.01 




.01 




.01 




.01 




.01 




.01 
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7 




40 


.05 


41 




5 


.01 


44 


4 


.01 


27 




7 


.01 


64 




5 


.01 


53 


4 


.01 


34 




7 


.01 


16 




.4 


.01 


<<others» 


107 


.14 


24 




6 


.01 


17 




4 


.01 








(214) 


call[var[globa 


1.9]] 
















f = 


4016 H 


= 4.241 


. P = 


.014, 


H contr. 


s 


.058 








68 


cases i 


25 shown 


) 
















8 




916 


.23 


17 




101 


.03 


26 


40 


.01 


9 




536 


.13 


18 




93 


.02 


27 


35 


.01 


10 




380 


.09 


19 




82 


.02 


29 


32 


.01 


11 




285 


.07 


20 




75 


.02 


28 


31 


.01 


12 




227 


.06 


21 




68 


.02 


30 


28 


.01 


13 




188 


.05 


22 




58 


.01 


31 


24 


.01 


14 




147 


.04 


23 




50 


.01 


32 


24 


.01 


15 




128 


.03 


24 




45 


.01 


«others» 


269 


.07 


16 




113 


.03 


25 




41 


.01 








(215) 


dot[var[globa1 


.8]] 
















f = 


1009 H 


= 2.251 


. P = 


.003. 


H contr. 


s 


.008 








16 


cases 

























434 


.43 


7 




7 


.01 


65 


2 


.00 


1 




214 


.21 


6 




5 


.00 


26 


1 


.00 


2 




193 


.19 


59 




5 


.00 


32 


1 


.00 


3 




95 


.09 


22 




3 


.00 


62 


1 


.00 


4 




27 


.03 


25 




3 


.00 








5 




16 


.02 


29 




2 


.00 








(216) 


dot[* 


var[global ,@]] 
















f = 


944 H 


= 4.912 


. P = 


.003, 


H contr. 


= 


.016 








48 


cases 


40 shown 


) 
















3 




97 


.10 


27 




18 


.02 


2 


9 


.01 


14 




87 


.09 


29 




18 


.02 


21 


9 


.01 







60 


.06 


18 




17 


.02 


25 


8 


.01 


6 




51 


.05 


17 




16 


.02 


5 




.01 


12 




48 


.05 


28 




16 


.02 


8 




.01 


22 




48 


.05 


38 




16 


.02 


15 




.01 


1 




46 


.05 


26 




15 


.02 


36 




.01 


32 




45 


.05 


16 




14 


.01 


44 




.01 


4 




43 


.05 


23 




14 


.01 


41 


6 


.01 


7 




27 


.03 


19 




12 


.01 


33 


5 


.01 


24 




27 


.03 


20 




12 


.01 


42 


5 


.01 


31 




24 


.03 


13 




11 


.01 


43 


5 


.01 


30 




23 


.02 


10 




10 


.01 


<<others>> 


17 


.02 


45 




20 


.02 


11 




10 


.01 








(217) 


signal 


[var[g1obal,0]] 
















f = 


164 H 


= 3.938 


. P = 


.001, 


H contr. 


= 


.002 








39 


cases 

























42 


.26 


15 






.01 


60 




.01 


1 




24 


.15 


17 






.01 


61 




.01 


2 




22 


.13 


20 






.01 


66 




.01 


3 




10 


.06 


21 






.01 


67 




.01 


4 




9 


.05 


26 






.01 


69 




.01 


5 




7 


.04 


29 






.01 


70 




.01 


7 




5 


.03 


31 






.01 


74 




.01 


58 




5 


.03 


34 






.01 


104 




.01 


6 




4 


.02 


40 






.01 


105 




.01 


27 




3 


.02 


46 






.01 


112 




.01 


542 




3 


.02 


47 






.01 


118 




.01 


16 




2 


.01 


53 






.01 


553 




.01 


538 




2 


.01 


54 






.01 


554 




.01 


(218) 


arraydesc[l ist[@]] 
















f = 


80 H 


= 0.000 


. P = 


.000. 


H contr. 


= 


.000 








2 




80 1 


.00 
















(219) 


call[^ 


Mist[0]] 




\ 












f = 


1816 H 


= 1.085 


. P = 


.006. 


H contr. 


= 


.007 








4 


cases 
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Appendix C: Detailed Pattern Data 



C.6. Final Pattern Set — Sorted by Pattern Number 



1390 
263 



77 
14 



121 
42 



,07 
,02 



(220) 
f = 
12 
1 
2 
3 
6 

(221) 
f = 
10 
2 
3 
4 
5 

(222) 

f = 

2 

(223) 
f = 
12 
2 
3 
4 
6 

(224) 

f = 

8 

2 

3 

4 

(225) 

f = 

2 

(226) 

f = 

5- 

2 

3 

(227) 
f = 
12 
2 " 
3 
4 
5 

(228) 

f = 

2 

(229) 
f = 
22 
16 
14 
1 
15 

(230) 
f = 



casestmt[*,list[0]] 
639 H = 1.208, p = 



cases 



478 

116 

18 

9 



.75 
.18 
.03 
.01 



dostmt[*,*,list[0]3 
253 H - 2.208. p = 
cases 

HI .44 
62 .25 
40 .16 
20 . 08 



002, H contr. « .003 

4 
5 
9 
10 



.001, H contP. « .002 

6 

7 
8 
10 



7 


.01 


16 


4 


.01 


17 


2 


.00 


21 


1 


.00 


34 



fextract[list[e]] 
196 H = 1.099. p « 
136 .69 

ifstnit[*.1ist[0]] 
598 H = 2.132, p = 
cases 

314 .53 

120 .20 

70 .12 

27 .06 

ifstmt[*,*,list[e]3 
173 H = 2.051, p = 
cases 

77 .45 
50 .29 
20 .12 

inlinecan[*,list[0]] 
230 H = .549. p = 
203 .88 

item[list[e]] 
97 H = 1.190. p = 
cases 

74 .76 
12 .12 

item[*,list[@]] 



001. H 
1 



contr. 



7 


.03 


11 


4 


.02 


12 


3 


.01 




2 


.01 
.001 




9 


.25 


3 



002. H contP. - .004 

5 
7 
8 
9 



001. H contP. = .001 

5 
6 
10 



26 


.04 


11 


19 


.03 


12 


8 


.01 


16 


5 


.01 


10 



001. H 
3 



contP. = 



15 


.09 


8 


7 


.04 


9 


2 


.01 
.000 




26 


.11 


6 



,000. H contP. 

6 
5 



425 H 
:as8S 



2.096, p = .001, H contP. = 



215 
95 
52 
27 



51 
22 
12 
06 



signa1[*,list[e]] 
44 H = 0.000. p = 
44 1.00 

assign[vap[local ,*. 
3114 H = 2.078. p = 
cases (11 shown) 

1719 .55 

726 .23 

284 .09 

73 .02 



,000. H contP. 



000 

.06 
.03 



003 



15 


.04 


10 


11 


.03 


12 


3 


.01 


14 


2 


.00 


42 



.6]] 
.011. 

8 

32 
11 
5 



H contP. = 



0.000 



022 



1 


.00 


1 


.00 


1 


.00 


1 


.00 



11 



01 
01 



,06 



4 


.01 


2 


.00 


2 


.00 


1 


.00 



01 
01 



,00 



02 



2 


.00 


1 


.00 


1 


.00 


1 


.00 



64 


.02 


3 


28 


.01 


64 


.02 


9 


19 


.01 


39 


.01 


7 


14 


.00 


30 


.01 


<<otheps>> 


54 


.02 



as sign[*,var[ local .♦.*,6]] 

590 H = 2.026. p = .002. H contP. 



.004 
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C.6. Final Pattern Set — Sorted by Pattern Number 



18 cases (12 shown) 

16 340 .58 11 

14 142 .24 3 

1 21 .04 32 

8 19 .03 5 

15 16 .03 10 






.02 


4 


3 


.01 


9 


.02 


176 


3 


.01 


9 


.02 


<<others» 


10 


.02 


4 


.01 








4 


.01 









(231) assignx[var[local ,*,*,0]] 

f = 182 H = 1.963. p = .001, H contr, 

10 cases 

16 99 .54 8 

14 43 .24 11 
1 13 .07 3 

15 11 .06 5 



001 



9 


.05 


9 


3 


.02 


32 


1 


.01 




1 


.01 





01 
01 



(232) call[*,var[local,*,*,S]] 

f = 1445 H =1.463. p = .005. H contr, 

11 cases 

16 773 .53 ir 

14 564 .39 15 

8 47 .03 3 

32 ■ 27 .02 6 



.007 



14 


.01 


4 


7 


.00 


5 


4 


.00 


13 


4 


.00 





.00 
,00 
.00 



(233) donar[var[local,*,*,6]] 

f = 840 H = 2.252. p = .003, H contr. 

12 cases 

16 420 .50 304 

32 164 .20 208 

176 109 .13 160 

48 43 .05 128 



,006 



41 


.05 


8 


19 


.02 


96 


11 


.01 


64 


9 


.01 


80 



8 


.01 


7 


.01 


5 


.01 


4 


.00 



(234) dot[var[local,*,*,@]] 
f = 1929 H = 0.000. p = .007. H contr, 
16 1929 1.00 



0.000 



(235) error[*.var[local .*,♦,©]] 
f = 45 H = 0.000, p = .000. H contr. 
16 45 1.00 



0.000 



(236) ifstnit[var(;iocal,*.*,0]] 
f = 113 H = 0.000. p = .000, H contr. 
1 113 1.00 



0.000 



(237) list[«.. . .var[1ocal,*,*,9]] 

f = 1909 H = 2.104. p = .006. H contr, 

17 cases (11 shown) 

16 986 .52 32 

14 542 .28 3 

15 77 .04 9 
1 70 .04 11 



,014 



51 


.03 


5 


31 


.02 


43 


.02 


8 


17 


.01 


38 


.02 


7 


6 


.00 


32 


.02 


«others» 


16 


.01 



(238) seqindex[var[1ocal,*,*,@]3 
f = 58 H = 0.000. p = .000, H 
16 58 1.00 



contr. = 0.000 



(239) seqindex[*,var[local ,*,*,0]] 
f = 50 H= .242. p = .000, H contr. = .000 
16 48 .96 3 2 .04 



(240) uparrow[var[1ocal ,*,*.@]] 
f = 355 H = 0.000. p = .001. H 
16 355 1.00 



contr. = 0.000 



(241) upthru[var[local ,*.*.6]] 

f = 129 H = 1.400, p = .000, H contr, 

9 cases 

16 99 .77 4 

15 10 .08 9 

3 5 .04 6 



001 



4 


.03 


8 


4 


.03 


5 


3 


.02 


7 



,02 
,01 
,01 
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APPENDIX CJ: DETAILED PATTERN UATA 



C.6. Final Pattern Set — Sorted by Pattern Number 



(242) dollar[«,var[field,@]] 

f = 2709 H = 2.433, p 

32 cases (11 shown) 

1432 .53 

1 370 .14 

2 343 .13 

3 157 .06 



009. H contr. = .022 

8 73 .03 

6 60 .0^ 

9 54 .02 

7 48 .02 



4 


47 


.02 


5 


40 


.01 


10 


23 


.01 


<<others>> 


62 


.02 



(243) dot[*,var[field,0]] 

f = 4570 H = 3.441, p = 

30 cases (17 shown) 

1032 .23 

3 701 .15 

4 597 .13 

1 588 .13 

2 390 .09 
8 290 .06 



015, H contr. = 



5 

7 

6 

12 

11 

10 



283 

146 

127 

70 

44 

42 



.053 



,06 
.03 
03 
.02 
.01 
.01 



26 

14 

17 

19 

9 

<<others>> 



36 
28 
27 
24 
23 
122 



.01 
,01 
,01 
,01 
,01 
,03 



(244) arraydesc[list[*, . . .0]] 



f = 160 H = 
10 cases 
num 
addr 
var 
call 



2.536, p = .001, H contr. = .001 

56 .35 plus 12 .08 

39 .24 dot 7 .04 

21 .13 div 4 .02 

17 .11 register 2 .01 



dollar 
base 



,01 
,01 



(245) body[list[*,...03] 
f = 6617 H = 2.639, p = 
27 cases (11 shown) 
assign 2312 .35 
call 1364 .21 
return 1207 .18 
ifstrat 744 .11 



022, H contr. = 



casestmt 
dostmt 
construct 
fextract 



313 

304 

89 

79 



.059 

.05 
.05 
.01 
.01 



rowcons 
open stmt 
bump 
<<others>> 



47 
31 
31 
96 



01 
00 
.00 
,01 



(246) can[*,list[* @]] 

f = 4263 H = 2.425, p = 
29 cases (11 shown) 



.014, H contr. « 



,035 



var 


1813 .43 


dollar 


144 


.03 


minus 


32 


.01 


num 


1216 .29 


addr 


112 


.03 


ifexp 


24 


.01 


dot 


447 .10 


plus 


84 


.02 


dindex 


22 


.01 


call 


226 .05 


str 


63 


.01 


<<others>> 


80 


.02 


(247) 


caseexp[«,list[*, . . , 


.0]] 












f = 


120 H = .495, p = 


.000, H contr. 


s 


.000 








item 


107 .89 


caseswitch 


13 


•11 








(248) 


casestmt[*,list[*, . . 


..0]] 












f = 


982 H = .363, p = 


.003, H contr. 


= 


.001 








item 


914 .93 


caseswitch 


68 


.07 








(249) 


casestmt[*,*.list[* 


....0]] 












f = 


159 H = 2.228, p = 


,001, H contr. 


3 


.001 








12 ( 


cases 














call 


71 .45 


signal 


5 


.03 


bump 


2 


.01 


assi< 


an 49 .31 


fextract 


3 


.02 


casestmt 


1 


.01 


if stmt 12 .08 


construct 


3 


.02 


exit 


1 


.01 


return 9 .06 


dostmt 


2 


.01 


error 


1 


.01 



(250) caseswitch[*,*,list[*,. . .0]] 
f = 572 H = 0.000, p = .002, H contr 
casetest 572 1.00 



= 0.000 



(251) casetest[list[*,. ..0]] 

f = 254 H = O.OGO, p = .001, H contr. 
num 254 1.00 

(252) casetest[*,listC*,...0]] 

f = 600 H = 1.863, p = .002, H contr. 
16 cases 
call 331 .55 fextract 



u 


. uuu 




.004 


5 


.01 



goto 



.00 
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assign 

ifstmt 

vconstruct 

casestmt 

bump 



C.6. Final Pattern Set — Sorted by Pattern Number 



169 

46 

12 

12 

9 



.28 
.08 
.02 
.02 
.02 



(253) catchphrase[list[*. 

f = 42 H = 0.000. p = 

item 42 1.00 



dostmt 

return 

exit 

inlinecall 

construct 

,0]] 

.000, H contr. 



01 
,01 
,00 
00 
00 



0.000 



start 

stop 

1st 



.00 
.00 
.00 



(254) {construct |constructx}[*,list[*, . . , 
f = 670 H = 2.948, p = .002, H contr. 


6]] 

3 


.007 








20 cases (13 shown 


) 














var 180 


.27 


dollar 


28 


.04 


index 


6 


.01 


num 162 


.24 


call 


15 


.02 


ifexp 


5 


.01 


unionx 124 


.19 


addr 


14 


.02 


times 


5 


.01 


dot 55 


.08 


constructx 


12 


.02 


<<others» 


13 


.02 


<empty> 42 


.06 


dindex 


9 


.01 








(255) dostmt[*,*,lis 
f = 828 H = 2.427 


t[*.. 
. P = 


.003, H contr. 


^ 


.007 








14 cases 
















assign 340 


.41 


dostmt 


22 


.03 


signal 


4 


.00 


ifstmt 189 


.23 


fextract 


15 


.02 


inlinecall 


3 


.00 


call 135 


.16 


construct 


10 


.01 


openstmt 


2 


.00 


casestmt 55 


.07 


catchmark 


6 


.01 


stop 


1 


.00 


bump 42 


.05 


label 


4 


.00 








(256) fextractClistC 
f = 354 H = .663 


♦, . . . 
. P = 


0]] 

.001, H contr. 


s 


.001 








assign 293 


.83 


<empty> 


61 


,17 








(257) inlinecan[*,1 
f = 489 H = 3.016 


ist[^ 
. P = 


.002, H contr. 


_ 


.005 








17 cases 
















num 160 


.33 


dollar 


16 


.03 


call 


5 


.01 


var 114 


.23 


addr 


16 


.03 


dindex 


3 


.01 


dot 44 


.09 


minus 


12 


.02 


uminus 


2 


.00 


uparrow 31 


.06 


seqindex 


12 


.02 


base 


2 


.00 


inlinecall 30 


.06 


index 


8 


.02 


ifexp 


1 


.00 


plus 27 


.06 


times 


6 


.01 









(258) itemClistC*,...®]] 
f = 243 H = .422, p = 

5 cases 
relE 228 .94 
in 10 .04 

(259) item[*,list[*,...@]] 
f = 1355 H = 2.625, p = 

21 cases (14 shown) 



.001, H contr, 

relL 
Ibl 



005, H contr. = 



000 

.01 
.01 



012 



relG 



00 



assign 


542 


.40 


dostmt 


28 


.02 


resume 


15 


.01 


call 


354 


.26 


exit 


25 


.02 


continue 


15 


.01 


ifstmt 


187 


.14 


goto 


17 


.01 


construct 


13 


.01 


casestmt 


47 


.03 


bump 


16 


.01 


vconstruct 


13 


.01 


return 


45 


.03 


fextract 


15 


.01 


<<others>> 


23 


.02 


260) label[l 


ist[*,. 


..0]] 














f = 187 H = 


2.311 


. P = 


.001, H contr. 


= 


.001 








12 cases 


















call 


62 


.33 


fextract 


5 


.03 


label 


1 


.01 


assign 


61 


.33 


bump 


4 


.02 


exit 


1 


.01 


ifstmt 


38 


.20 


construct 


3 


.02 


openstmt 


1 


.01 


dostmt 


7 


.04 


casestmt 


3 


.02 


catchmark 


1 


.01 



(261) mwconst[list[*,. . .6]] 

f = 133 H = 0.000, p = .000, H contr, 
num 133 1.00 

(262) openstmt[*,list[*,. ..0]] 



0.000 
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C.6. Final Pattern Set — Sorted by Pattern Number 



f = 391 H 
15 cases 
assign 
call 
if stmt 
casestmt 
dostmt 



2.113. p 



206 

88 

47 

15 

8 



.53 
.23 
.12 
04 
.02 



0]] 



(263) return[list[» 

f = 160 H = 1.636. p = 

9 cases 

var 98 .61 

num 41 .26 

dollar 7 ,04 



001, H contr. » .003 

fextract 6 .02 

return 4 .01 

signal 4 .01 

construct 3 .01 

bump 3 .01 



001, H contr. =» .001 

dot 6 .04 

call 3 .02 

index 2 .01 



goto 


2 


.01 


start 


2 


.01 


dst 


1 


.00 


1st 


1 


.00 


catchmark 


1 


.00 



div 
mod 
addr 



,01 
.01 
.01 



(264) row[list[*,...e]] 
f = 686 H = .882, p = .002, H contr, 
num 480 .70 str 



206 



002 
.30 



(265) signal[*,list[*,.. .6]] 
f = 88 H = 1.680. p = .000, H contr. 
5 cases 
num 39 .44 addr 
var 26 .30 dollar 



21 
1 



001 



24 
01 



call 



,01 



(266) vconstruct[*,list[*,. . .©]] 
f = 32 H = 0.000, p = .000. H contr, 
unionx 32 1.00 



0.000 



(267) 
f = 


dot[*.var[field,*,*, 
4570 H = 2.548, p = 


.6]] 
.015, 


H 


contr. 


a 


.039 








27 
16 


cases (19 shown) 

2647 . 58 


10 






84 


.02 


128 


44 


.01 


14 


412 .09 


2 






75 


.02 


48 


43 


.01 


64 


302 .07 


8 






66 


.01 


9 


38 


.01 


1 


271 .06 


11 






64 


.01 


80 


26 


.01 


32 


109 .02 


3 






51 


.01 


6 


23 


.01 


7 


108 .02 


4 






44 


.01 


<<others>> 


32 


.01 


15 


87 .02 


5 






44 


.01 








(268) 
f = 


dollar[*,var[field,* 
2709 H = 3.287. p = 


^*.0]] 

.009, 


H 


contr. 


. 


.030 








33 
16 


cases (18 shown) 

792 . 29 


3 






73 


.03 


80 


24 


.01 


14 


562 .21 


15 






70 


.03 


13 


21 


.01 


1 


264 .10 


12 






52 


.02 


5 


20 


.01 


2 


259 .10 


11 






41 


.02 


9 


17 


.01 


4 


209 .08 









32 


.01 


«others» 


35 


.01 


32 


97 .04 


48 






32 


.01 








8 


80 .03 


128 






29 


.01 








(269) 
f = 


bump[@] 

163 H = 1.027. p = 


.001, 


H 


contr. 


3 


.001 








4 


cases 


















var 


126 . 77 


doll 


lar 


10 


.06 








dot 


25 . 15 


uparrow 


2 


.01 








(270) 
f = 
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